1. Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

2. Methodology

Processes and methodologies act as the skeleton framework on which successful projects are built.

The cross-industry standard process for data mining (CRISP-DM) methodology is an open standard process model that describes common approaches used by data mining experts. In this project, we utilize the CRISP-DM methodology.

CRISP-DM breaks down into six phases.

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

<img src="CRISP-DM_Process_Diagram.png" title = "CRISP-DM" width= 400 height = 400 alt="By Kenneth Jensen - Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610"/>

3. Business Understanding

Focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition and a preliminary plan.

For the current scenario, we are going to:

  • Exploratory data analysis on
    • univariant frequency distribution by:
      • age
      • gender
      • income
      • membership join date
      • events
    • multivariant frequency distribution by:
      • events by gender
      • events by income
      • gender by income
  • build a machine learning model that predicts whether or not someone will respond to an offer.
  • build a machine learning model that predicts purchasing habits.
  • build a machine learning model that predicts the best offer for an individual.

4. Data Understanding

There are two methods in which Data Understanding phase practiced:

  1. Starts with an initial data collection and proceeds with activities to get familiar with the data, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

  2. We recognize specific interesting questions and then collect data related to those questions.

The transformation from Business to Data understanding phase is not linear; instead, it is cyclic.

In the current project, we are going to utilize only the data provided by Starbucks as it is challenging to work with inherent limitations in data. Thereby we are practicing first method.

The data is contained in three files:

  • portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json - demographic data for each customer
  • transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

  • id (string) - offer id
  • offer_type (string) - type of offer ie BOGO, discount, informational
  • difficulty (int) - minimum required spend to complete an offer
  • reward (int) - reward given for completing an offer
  • duration (int) - time for offer to be open, in days
  • channels (list of strings)

profile.json

  • age (int) - age of the customer
  • became_member_on (int) - date when customer created an app account
  • gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
  • id (str) - customer id
  • income (float) - customer's income

transcript.json

  • event (str) - record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) - customer id
  • time (int) - time in hours since start of test. The data begins at time t=0
  • value - (dict of strings) - either an offer id or transaction amount depending on the record
In [1]:
import pandas as pd
import numpy as np
import math
import json

!pip install joblib
from joblib import dump, load

from sklearn.pipeline import Pipeline, make_union

from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import make_scorer,r2_score, mean_squared_error, f1_score, classification_report, accuracy_score

from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, RandomForestClassifier, RandomForestRegressor, \
GradientBoostingClassifier, GradientBoostingRegressor, ExtraTreesClassifier, ExtraTreesRegressor

from sklearn.multioutput import MultiOutputClassifier

import plotly.plotly as py
import plotly.graph_objs as go

from plotly.offline import init_notebook_mode, iplot

init_notebook_mode(connected=True)

import gc

import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: joblib in /anaconda3/lib/python3.6/site-packages (0.13.2)
In [2]:
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

5. Data Preparation

The data preparation phase covers all activities to construct the final dataset from the initial raw data. Data preparation is 80% of the process.

Data wrangling is the core action in this phase. There is no one way to perform data wrangling, as a rule of thumb we will approach Data Wrangling in two steps:

  1. Assessing - Identifying issues in data and potential fixes.
  2. Cleaning - Implement possible fixes and run unit tests to validate the fixes.

5.1. Data Wrangling

Data Wrangling is part of Data Understanding and Data Preparation phases of the CRISP-DM model and is the first programming step.

Data Wrangling is language and framework independent, and there is no one right way. In our case, we are using Python as the programming language of choice and Pandas as the data manipulation framework.

I am going to divide Data Wrangling into three steps:

  1. Gather Data - In this step we will collect and load necessary data into the data processing framework.
  2. Assess Data - In this step we are going to perform the syntactical and semantical check on the data and identify any issues in the data along with potential fixes.
  3. Clean Data - In this step we will implement the data fixes from the Assessment phase. We will also run small unit tests to make sure the data repairs are working as expected.

Data Wrangling is a cyclic process, and often we need to revisit the steps again and again.

We will perform the Data Wrangling on all three of the data sources provided by Starbucks.

In [3]:
portfolio.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
channels      10 non-null object
difficulty    10 non-null int64
duration      10 non-null int64
id            10 non-null object
offer_type    10 non-null object
reward        10 non-null int64
dtypes: int64(3), object(3)
memory usage: 560.0+ bytes
In [4]:
portfolio.isnull().any()
Out[4]:
channels      False
difficulty    False
duration      False
id            False
offer_type    False
reward        False
dtype: bool
In [5]:
portfolio.head(10)
Out[5]:
channels difficulty duration id offer_type reward
0 [email, mobile, social] 10 7 ae264e3637204a6fb9bb56bc8210ddfd bogo 10
1 [web, email, mobile, social] 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 bogo 10
2 [web, email, mobile] 0 4 3f207df678b143eea3cee63160fa8bed informational 0
3 [web, email, mobile] 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 bogo 5
4 [web, email] 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 discount 5
5 [web, email, mobile, social] 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 discount 3
6 [web, email, mobile, social] 10 10 fafdcd668e3743c1bb461111dcafc2a4 discount 2
7 [email, mobile, social] 0 3 5a8bc65990b245e5a138643cd4eb9837 informational 0
8 [web, email, mobile, social] 5 5 f19421c1d4aa40978ebb69ca19b0e20d bogo 5
9 [web, email, mobile] 10 7 2906b810c7d4411798c6938adc9daaa5 discount 2

From visual and programmatic assessment, Portfolio data set has only ten rows with no missing data.

However, the data is not in machine learning friendly structure. We are going to apply one hot encoding methodologies on channels and offer_type columns.

Clean Data

1. Create dummies for offer_type column

Code

In [6]:
portfolio_for_ml = pd.get_dummies(portfolio,columns=['offer_type'])
In [7]:
portfolio_for_ml.rename(columns={'offer_type_bogo':'bogo',
                                 'offer_type_discount':'discount',
                                 'offer_type_informational':'informational'},
                        inplace=True)

Test

In [8]:
portfolio_for_ml.head(10)
Out[8]:
channels difficulty duration id reward bogo discount informational
0 [email, mobile, social] 10 7 ae264e3637204a6fb9bb56bc8210ddfd 10 1 0 0
1 [web, email, mobile, social] 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 10 1 0 0
2 [web, email, mobile] 0 4 3f207df678b143eea3cee63160fa8bed 0 0 0 1
3 [web, email, mobile] 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 5 1 0 0
4 [web, email] 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 5 0 1 0
5 [web, email, mobile, social] 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 3 0 1 0
6 [web, email, mobile, social] 10 10 fafdcd668e3743c1bb461111dcafc2a4 2 0 1 0
7 [email, mobile, social] 0 3 5a8bc65990b245e5a138643cd4eb9837 0 0 0 1
8 [web, email, mobile, social] 5 5 f19421c1d4aa40978ebb69ca19b0e20d 5 1 0 0
9 [web, email, mobile] 10 7 2906b810c7d4411798c6938adc9daaa5 2 0 1 0

2. Split Chanels colmns into email, mobile, social and web columns

Code

In [9]:
def channels_email(data):
    email, mobile, social, web = 0, 0, 0, 0
    if 'email' in data:
        return 1
    else:
        return 0
    
def channels_mobile(data):
    if 'mobile' in data:
        return 1
    else:
        return 0
    
def channels_social(data):
    if 'social' in data:
        return 1
    else:
        return 0
    
def channels_web(data):
    if 'web' in data:
        return 1
    else:
        return 0
In [10]:
portfolio_for_ml['email'] = portfolio_for_ml.channels.apply(lambda x: channels_email(x))
portfolio_for_ml['mobile'] = portfolio_for_ml.channels.apply(lambda x: channels_mobile(x))
portfolio_for_ml['social'] = portfolio_for_ml.channels.apply(lambda x: channels_social(x))
portfolio_for_ml['web'] = portfolio_for_ml.channels.apply(lambda x: channels_web(x))
In [11]:
portfolio_for_ml.drop(columns=['channels'],inplace=True)

Test

In [12]:
portfolio_for_ml.head(10)
Out[12]:
difficulty duration id reward bogo discount informational email mobile social web
0 10 7 ae264e3637204a6fb9bb56bc8210ddfd 10 1 0 0 1 1 1 0
1 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 10 1 0 0 1 1 1 1
2 0 4 3f207df678b143eea3cee63160fa8bed 0 0 0 1 1 1 0 1
3 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 5 1 0 0 1 1 0 1
4 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 5 0 1 0 1 0 0 1
5 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 3 0 1 0 1 1 1 1
6 10 10 fafdcd668e3743c1bb461111dcafc2a4 2 0 1 0 1 1 1 1
7 0 3 5a8bc65990b245e5a138643cd4eb9837 0 0 0 1 1 1 1 0
8 5 5 f19421c1d4aa40978ebb69ca19b0e20d 5 1 0 0 1 1 1 1
9 10 7 2906b810c7d4411798c6938adc9daaa5 2 0 1 0 1 1 0 1

3. Instead of using id (offer ID) we will map dummy values

Code

In [13]:
portfolio_for_ml['offer_code'] = (portfolio_for_ml.index.values+1)

Test

In [14]:
portfolio_for_ml.head(10)
Out[14]:
difficulty duration id reward bogo discount informational email mobile social web offer_code
0 10 7 ae264e3637204a6fb9bb56bc8210ddfd 10 1 0 0 1 1 1 0 1
1 10 5 4d5c57ea9a6940dd891ad53e9dbe8da0 10 1 0 0 1 1 1 1 2
2 0 4 3f207df678b143eea3cee63160fa8bed 0 0 0 1 1 1 0 1 3
3 5 7 9b98b8c7a33c4b65b9aebfe6a799e6d9 5 1 0 0 1 1 0 1 4
4 20 10 0b1e1539f2cc45b7b9fa7c272da2e1d7 5 0 1 0 1 0 0 1 5
5 7 7 2298d6c36e964ae4a3e7e9706d1fb8c2 3 0 1 0 1 1 1 1 6
6 10 10 fafdcd668e3743c1bb461111dcafc2a4 2 0 1 0 1 1 1 1 7
7 0 3 5a8bc65990b245e5a138643cd4eb9837 0 0 0 1 1 1 1 0 8
8 5 5 f19421c1d4aa40978ebb69ca19b0e20d 5 1 0 0 1 1 1 1 9
9 10 7 2906b810c7d4411798c6938adc9daaa5 2 0 1 0 1 1 0 1 10

We will consolidate all the cleaning steps into one single function.

Code

In [15]:
def generate_portfolio_for_ml(portfolio = portfolio.copy()):
    portfolio_for_ml = pd.get_dummies(portfolio,columns=['offer_type'])
    
    portfolio_for_ml.rename(columns={'offer_type_bogo':'bogo',
                                     'offer_type_discount':'discount',
                                     'offer_type_informational':'informational'},
                            inplace=True)
    
    portfolio_for_ml['email'] = portfolio_for_ml.channels.apply(lambda x: channels_email(x))
    portfolio_for_ml['mobile'] = portfolio_for_ml.channels.apply(lambda x: channels_mobile(x))
    portfolio_for_ml['social'] = portfolio_for_ml.channels.apply(lambda x: channels_social(x))
    portfolio_for_ml['web'] = portfolio_for_ml.channels.apply(lambda x: channels_web(x))
    
    portfolio_for_ml.drop(columns=['channels'],inplace=True)
    portfolio_for_ml['offer_code'] = (portfolio_for_ml.index.values+1)
    
    return portfolio_for_ml

Test

In [16]:
portfolio_for_ml_1 = generate_portfolio_for_ml()

portfolio_for_ml.equals(portfolio_for_ml_1)
Out[16]:
True
In [17]:
del portfolio_for_ml_1

gc.collect()
Out[17]:
402
In [18]:
portfolio_for_ml.to_csv('portfolio_for_ml.csv',index=False)

5.1.2. Profile

Assess Data

In [19]:
profile.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
age                 17000 non-null int64
became_member_on    17000 non-null int64
gender              14825 non-null object
id                  17000 non-null object
income              14825 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.1+ KB
In [20]:
profile.isnull().any()
Out[20]:
age                 False
became_member_on    False
gender               True
id                  False
income               True
dtype: bool
In [21]:
def plot_age(profile = profile):
    trace = go.Histogram(x=profile.age.values,
                         name='Age',
                         marker=dict(color='rgba(95,158,209,1)',))

    layout = go.Layout(title = 'Age Distribution',
                      xaxis=dict(title='Age'))

    fig = go.Figure(data=go.Data([trace]), layout=layout)
    iplot(fig)    
In [22]:
def plot_gender(profile = profile):
    trace = go.Histogram(x=profile.gender.values,
                         name='Gender',
                         marker=dict(color='rgba(95,158,209,1)',))

    layout = go.Layout(title = 'Gender Distribution',
                      xaxis=dict(title='Gender'))

    fig = go.Figure(data=go.Data([trace]), layout=layout)
    iplot(fig)
In [23]:
def plot_income(profile = profile):
    trace = go.Histogram(x=profile.income.values,
                         name='Income',
                         marker=dict(color='rgba(95,158,209,1)',))

    layout = go.Layout(title = 'Income Distribution',
                      xaxis=dict(title='Income'))

    fig = go.Figure(data=go.Data([trace]), layout=layout)
    iplot(fig)
In [24]:
def plot_join_date(profile = profile):
    trace = go.Histogram(x=profile.became_member_on,
                         name='Became Member on',
                         marker=dict(color='rgba(95,158,209,1)',))

    layout = go.Layout(title = 'Membership Join Date Distribution',
                      xaxis=dict(title='Membership Join Date'))

    fig = go.Figure(data=go.Data([trace]), layout=layout)
    iplot(fig)
In [25]:
plot_age()

plot_gender()

plot_income()
In [26]:
(profile.age==118).sum(),profile.gender.isnull().sum(),profile.income.isnull().sum()
Out[26]:
(2175, 2175, 2175)
In [27]:
profile[(profile.age==118) & (profile.gender.isnull()) & (profile.income.isnull())].shape
Out[27]:
(2175, 5)
In [28]:
profile[(profile.age==118) & (profile.gender.isnull()) & (profile.income.isnull())].shape[0]/ profile.shape[0]
Out[28]:
0.12794117647058822

From the visual assessment, in the Profile Data set:

  1. became_member_on column is not in DateTime format.
  2. if the age information is missing, then the age is populated by default with '118.'

From the programmatic assessment, in the Profile Data set:

  1. gender and income columns have missing data.
  2. the same columns with missing gender and income information are having age value '118.'

Following fixes will be implemented on the Profile Data set in clean phase:

  1. Drop rows with missing values, which should implicitly drop rows with age '118.'
  2. Convert became_member_on to Pandas DateTime datatype.

The data is not in machine learning friendly structure. We will create a new ML friendly Pandas Data frame with following changes:

  1. apply one hot encoding methodologies on gender column.
  2. became_member_on column is split into year, month and date columns, and the became_member_on column is dropped

Clean Data

1. Drop rows with missing values, which should implicitly drop rows with age '118.'

Code

In [29]:
profile.dropna(inplace=True)
profile.reset_index(inplace=True,drop=True)

Test

In [30]:
(profile.age==118).sum(),profile.gender.isnull().sum(),profile.income.isnull().sum()
Out[30]:
(0, 0, 0)
In [31]:
profile[(profile.age==118) & (profile.gender.isnull()) & (profile.income.isnull())].shape
Out[31]:
(0, 5)
In [32]:
plot_age()

plot_gender()

plot_income()

2. Convert became_member_on to Pandas DateTime datatype.

Code

In [33]:
profile['became_member_on'] = pd.to_datetime(profile.became_member_on,format='%Y%m%d')

Test

In [34]:
profile.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14825 entries, 0 to 14824
Data columns (total 5 columns):
age                 14825 non-null int64
became_member_on    14825 non-null datetime64[ns]
gender              14825 non-null object
id                  14825 non-null object
income              14825 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 579.2+ KB
In [35]:
profile.head()
Out[35]:
age became_member_on gender id income
0 55 2017-07-15 F 0610b486422d4921ae7d2bf64640c50b 112000.0
1 75 2017-05-09 F 78afa995795e4d85b5d9ceeca43f5fef 100000.0
2 68 2018-04-26 M e2127556f4f64592b11af22de27a7932 70000.0
3 65 2018-02-09 M 389bc3fa690240e798340f5a15918d5c 53000.0
4 58 2017-11-11 M 2eeac8d8feae4a8cad5a6af0499a211d 51000.0
In [36]:
plot_join_date()

3. Create machine learning friendly profile.

Code

In [37]:
profile_for_ml = profile.copy()
In [38]:
profile_for_ml = pd.get_dummies(profile_for_ml,columns=['gender'])
In [39]:
profile_for_ml['became_member_on_year'] = profile_for_ml.became_member_on.dt.year
profile_for_ml['became_member_on_month'] = profile_for_ml.became_member_on.dt.month
profile_for_ml['became_member_on_date'] = profile_for_ml.became_member_on.dt.day

profile_for_ml.drop(columns=['became_member_on'], inplace=True)

Test

In [40]:
profile_for_ml.head()
Out[40]:
age id income gender_F gender_M gender_O became_member_on_year became_member_on_month became_member_on_date
0 55 0610b486422d4921ae7d2bf64640c50b 112000.0 1 0 0 2017 7 15
1 75 78afa995795e4d85b5d9ceeca43f5fef 100000.0 1 0 0 2017 5 9
2 68 e2127556f4f64592b11af22de27a7932 70000.0 0 1 0 2018 4 26
3 65 389bc3fa690240e798340f5a15918d5c 53000.0 0 1 0 2018 2 9
4 58 2eeac8d8feae4a8cad5a6af0499a211d 51000.0 0 1 0 2017 11 11

We will consolidate all the cleaning steps into one single function.

In [41]:
def clean_profile(profile = profile.copy()):
    profile.dropna(inplace=True)
    profile['became_member_on'] = pd.to_datetime(profile.became_member_on,format='%Y%m%d')
    
    return profile
In [42]:
def generate_profile_for_ml(profile = clean_profile(profile= profile.copy())):
    profile_for_ml = profile.copy()
    
    profile_for_ml = pd.get_dummies(profile_for_ml,columns=['gender'])
    
    profile_for_ml['became_member_on_year'] = profile_for_ml.became_member_on.dt.year
    profile_for_ml['became_member_on_month'] = profile_for_ml.became_member_on.dt.month
    profile_for_ml['became_member_on_date'] = profile_for_ml.became_member_on.dt.day

    profile_for_ml.drop(columns=['became_member_on'], inplace=True)
    
    return profile_for_ml

Test

In [43]:
profile_for_ml_1 = generate_profile_for_ml(clean_profile())

profile_for_ml.equals(profile_for_ml_1)
Out[43]:
True
In [44]:
profile_for_ml_1 = generate_profile_for_ml()

profile_for_ml.equals(profile_for_ml_1)
Out[44]:
True
In [45]:
del profile_for_ml_1
gc.collect()
Out[45]:
28455
In [46]:
profile_for_ml.to_csv('profile_for_ml.csv',index=False)
In [47]:
transcript.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
event     306534 non-null object
person    306534 non-null object
time      306534 non-null int64
value     306534 non-null object
dtypes: int64(1), object(3)
memory usage: 9.4+ MB
In [48]:
transcript.isnull().any()
Out[48]:
event     False
person    False
time      False
value     False
dtype: bool
In [49]:
transcript.head()
Out[49]:
event person time value
0 offer received 78afa995795e4d85b5d9ceeca43f5fef 0 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'}
1 offer received a03223e636434f42ac4c3df47e8bac43 0 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'}
2 offer received e2127556f4f64592b11af22de27a7932 0 {'offer id': '2906b810c7d4411798c6938adc9daaa5'}
3 offer received 8ec6ce2a7e7949b1bf142def7d0e0586 0 {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'}
4 offer received 68617ca6246f4fbc85e91a2a49552598 0 {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'}
In [50]:
# Uncomment below code if you want to print the value_counts
#transcript.value.value_counts()
In [51]:
def plot_events(transcript = transcript):
    trace = go.Histogram(x=transcript.event.values,
                         name='Event',
                         marker=dict(color='rgba(95,158,209,1)',))

    layout = go.Layout(title = 'Event Distribution',
                       xaxis=dict(title='Events'))

    fig = go.Figure(data=go.Data([trace]), layout=layout)
    iplot(fig)
In [52]:
plot_events()

From a visual and programmatic assessment, there are no data issues in the Transcript data set.

However, the data whether a promotion influenced the user is not defined. A user is deemed to be influenced by promotion only after the individual made a transaction after viewing the advertisement.

In the cleaning step:

  1. Create dummy columns out of event column.
  2. "value" column is a composite column that contains Offer ID, Reward and Amount information. We will extract information into individual columns.
  3. When an individual has utilized an offer, there are two transactions records created, one for claiming the reward another for making the purchase. We are going to consolidate these two transaction records into one.
  4. Each offer is valid only for a certain number of days once received. In the current data frame, we do not have this information. For successful completion of the offer, the offer should be utilized before expiration.
  5. We will use time columns information to create new columns: offer_received_time, offer_viewed_time, offer_completed_time.
  6. A person can receive the same offer multiple times. To consolidate transaction records associated within offer expiration time, we will create a new column "offerid_expiration" and use this column to group the transactions.
  7. Consolidate transaction records associated within offer expiration time.
  8. We still have different transaction records for viewing/ completing. We will remove rows these rows as have already captured this information in offer received transaction.
  9. When we consolidated the transactions, for purchases that were performed without coupon, "amount_y" column is populated by maximum amount spent by the person. We need to correct this.
  10. For regular transactions, we still have the expiration column populated. We will fill the expiration with 0.
  11. A user is deemed to be influenced by promotion only after the individual made a transaction after viewing the advertisement. We will create a new column and populate if the promotion or not influence the individual.
  12. Create a new column to capture transaction time.
  13. When the transactions are consolidated, we lost information about offer_received, offer_viewed, offer_completed columns. We need to populate with correct values.

Clean

In [53]:
def get_offer_id(data):
    try:
        return data['offer id']
    except KeyError:
        try:
            return data['offer_id']
        except:
            return ''
In [54]:
def get_reward(data):
    try:
        return data['reward']
    except KeyError:
        return 0
In [55]:
def get_amount(data):
    try:
        return data['amount']
    except KeyError:
        return 0
In [56]:
def get_duration(offer_id):
    if offer_id.strip() != '':
        return portfolio[portfolio.id == offer_id]['duration'].values[0]
    else:
        return 0
In [57]:
transcript_clone = transcript.copy()

1. Create dummy columns out of event column

Code

In [58]:
transcript_clone =pd.get_dummies(transcript_clone,columns=['event'])
In [59]:
transcript_clone.rename(columns={'event_offer completed':'offer_completed',
                                 'event_offer received':'offer_received',
                                 'event_offer viewed':'offer_viewed',
                                 'event_transaction':'transaction'},
                        inplace=True)

Test

In [60]:
transcript_clone.head()
Out[60]:
person time value offer_completed offer_received offer_viewed transaction
0 78afa995795e4d85b5d9ceeca43f5fef 0 {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0 1 0 0
1 a03223e636434f42ac4c3df47e8bac43 0 {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0 1 0 0
2 e2127556f4f64592b11af22de27a7932 0 {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0 1 0 0
3 8ec6ce2a7e7949b1bf142def7d0e0586 0 {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0 1 0 0
4 68617ca6246f4fbc85e91a2a49552598 0 {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0 1 0 0

2. "value" column is a composite column that contains Offer ID, Reward and Amount information. We will extract information into individual columns

Code

In [61]:
transcript_clone['offer_id'] = transcript_clone.value.apply(get_offer_id)
transcript_clone['reward'] = transcript_clone.value.apply(get_reward)
transcript_clone['amount'] = transcript_clone.value.apply(get_amount)

transcript_clone.drop(columns=['value'],inplace=True)
In [62]:
transcript_clone = transcript_clone[['person','time','offer_id','offer_received','offer_viewed','offer_completed',
                                     'transaction','reward','amount']]

Test

In [63]:
transcript_clone.head()
Out[63]:
person time offer_id offer_received offer_viewed offer_completed transaction reward amount
0 78afa995795e4d85b5d9ceeca43f5fef 0 9b98b8c7a33c4b65b9aebfe6a799e6d9 1 0 0 0 0 0.0
1 a03223e636434f42ac4c3df47e8bac43 0 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 0 0 0 0 0.0
2 e2127556f4f64592b11af22de27a7932 0 2906b810c7d4411798c6938adc9daaa5 1 0 0 0 0 0.0
3 8ec6ce2a7e7949b1bf142def7d0e0586 0 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 0.0
4 68617ca6246f4fbc85e91a2a49552598 0 4d5c57ea9a6940dd891ad53e9dbe8da0 1 0 0 0 0 0.0

3. When an individual has utilized an offer, there are two transactions records created, one for claiming the reward another for making the purchase. We are going to consolidate these two transaction records into one.

When we look below output, we can see that for person 0009655768c64bdeb2e877511632db8f at time 414 there are two records one for purchase and one for claiming the reward.

Once the data frame is cleaned, this should not be the case.

In [64]:
transcript_clone.sort_values(['person','time']).head(20)
Out[64]:
person time offer_id offer_received offer_viewed offer_completed transaction reward amount
55972 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 0.00
77705 0009655768c64bdeb2e877511632db8f 192 5a8bc65990b245e5a138643cd4eb9837 0 1 0 0 0 0.00
89291 0009655768c64bdeb2e877511632db8f 228 0 0 0 1 0 22.16
113605 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.00
139992 0009655768c64bdeb2e877511632db8f 372 3f207df678b143eea3cee63160fa8bed 0 1 0 0 0 0.00
153401 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00
168412 0009655768c64bdeb2e877511632db8f 414 0 0 0 1 0 8.57
168413 0009655768c64bdeb2e877511632db8f 414 f19421c1d4aa40978ebb69ca19b0e20d 0 0 1 0 5 0.00
187554 0009655768c64bdeb2e877511632db8f 456 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00
204340 0009655768c64bdeb2e877511632db8f 504 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 0.00
228422 0009655768c64bdeb2e877511632db8f 528 0 0 0 1 0 14.11
228423 0009655768c64bdeb2e877511632db8f 528 fafdcd668e3743c1bb461111dcafc2a4 0 0 1 0 2 0.00
233413 0009655768c64bdeb2e877511632db8f 540 fafdcd668e3743c1bb461111dcafc2a4 0 1 0 0 0 0.00
237784 0009655768c64bdeb2e877511632db8f 552 0 0 0 1 0 13.56
247879 0009655768c64bdeb2e877511632db8f 576 2906b810c7d4411798c6938adc9daaa5 1 0 0 0 0 0.00
258883 0009655768c64bdeb2e877511632db8f 576 0 0 0 1 0 10.27
258884 0009655768c64bdeb2e877511632db8f 576 2906b810c7d4411798c6938adc9daaa5 0 0 1 0 2 0.00
293497 0009655768c64bdeb2e877511632db8f 660 0 0 0 1 0 12.36
300930 0009655768c64bdeb2e877511632db8f 690 0 0 0 1 0 28.16
302205 0009655768c64bdeb2e877511632db8f 696 0 0 0 1 0 18.41

Code

In [65]:
transcript_clean = transcript_clone.groupby(['person','time'],as_index=False).agg('max')

Test

In [66]:
transcript_clean.sort_values(['person','time']).head(20)
Out[66]:
person time offer_id offer_received offer_viewed offer_completed transaction reward amount
0 0009655768c64bdeb2e877511632db8f 168 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 0.00
1 0009655768c64bdeb2e877511632db8f 192 5a8bc65990b245e5a138643cd4eb9837 0 1 0 0 0 0.00
2 0009655768c64bdeb2e877511632db8f 228 0 0 0 1 0 22.16
3 0009655768c64bdeb2e877511632db8f 336 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.00
4 0009655768c64bdeb2e877511632db8f 372 3f207df678b143eea3cee63160fa8bed 0 1 0 0 0 0.00
5 0009655768c64bdeb2e877511632db8f 408 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00
6 0009655768c64bdeb2e877511632db8f 414 f19421c1d4aa40978ebb69ca19b0e20d 0 0 1 1 5 8.57
7 0009655768c64bdeb2e877511632db8f 456 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00
8 0009655768c64bdeb2e877511632db8f 504 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 0.00
9 0009655768c64bdeb2e877511632db8f 528 fafdcd668e3743c1bb461111dcafc2a4 0 0 1 1 2 14.11
10 0009655768c64bdeb2e877511632db8f 540 fafdcd668e3743c1bb461111dcafc2a4 0 1 0 0 0 0.00
11 0009655768c64bdeb2e877511632db8f 552 0 0 0 1 0 13.56
12 0009655768c64bdeb2e877511632db8f 576 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 10.27
13 0009655768c64bdeb2e877511632db8f 660 0 0 0 1 0 12.36
14 0009655768c64bdeb2e877511632db8f 690 0 0 0 1 0 28.16
15 0009655768c64bdeb2e877511632db8f 696 0 0 0 1 0 18.41
16 00116118485d4dfda04fdbaba9a87b5c 168 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00
17 00116118485d4dfda04fdbaba9a87b5c 216 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00
18 00116118485d4dfda04fdbaba9a87b5c 294 0 0 0 1 0 0.70
19 00116118485d4dfda04fdbaba9a87b5c 456 0 0 0 1 0 0.20

4. Each offer is valid only for a certain number of days once received. In the current data frame, we do not have this information. For successful completion of the offer, the offer should be utilized before expiration.

Code

In [67]:
transcript_clean['duration'] = transcript_clean[transcript_clean.offer_received == 1].offer_id.apply(get_duration)

transcript_clean.duration.fillna(0,inplace=True)

transcript_clean['duration'] = transcript_clean.duration.apply(lambda x:x*24)

transcript_clean['expiration'] = transcript_clean.time + transcript_clean.duration

transcript_clean.drop(columns='duration',inplace=True)
    
transcript_clean = transcript_clean[['person', 'time', 'expiration','offer_id', 'offer_received', 'offer_viewed',
                                     'offer_completed', 'transaction', 'reward', 'amount']]

transcript_clean['expiration'] = transcript_clean.expiration.astype(int)

Test

In [68]:
transcript_clean.head(20)
Out[68]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward amount
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 0.00
1 0009655768c64bdeb2e877511632db8f 192 192 5a8bc65990b245e5a138643cd4eb9837 0 1 0 0 0 0.00
2 0009655768c64bdeb2e877511632db8f 228 228 0 0 0 1 0 22.16
3 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.00
4 0009655768c64bdeb2e877511632db8f 372 372 3f207df678b143eea3cee63160fa8bed 0 1 0 0 0 0.00
5 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00
6 0009655768c64bdeb2e877511632db8f 414 414 f19421c1d4aa40978ebb69ca19b0e20d 0 0 1 1 5 8.57
7 0009655768c64bdeb2e877511632db8f 456 456 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00
8 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 0.00
9 0009655768c64bdeb2e877511632db8f 528 528 fafdcd668e3743c1bb461111dcafc2a4 0 0 1 1 2 14.11
10 0009655768c64bdeb2e877511632db8f 540 540 fafdcd668e3743c1bb461111dcafc2a4 0 1 0 0 0 0.00
11 0009655768c64bdeb2e877511632db8f 552 552 0 0 0 1 0 13.56
12 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 10.27
13 0009655768c64bdeb2e877511632db8f 660 660 0 0 0 1 0 12.36
14 0009655768c64bdeb2e877511632db8f 690 690 0 0 0 1 0 28.16
15 0009655768c64bdeb2e877511632db8f 696 696 0 0 0 1 0 18.41
16 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00
17 00116118485d4dfda04fdbaba9a87b5c 216 216 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00
18 00116118485d4dfda04fdbaba9a87b5c 294 294 0 0 0 1 0 0.70
19 00116118485d4dfda04fdbaba9a87b5c 456 456 0 0 0 1 0 0.20

From the above output, it looks like we have populated the transactions that are not offer received with the transaction timestamp. We need to fill with correct offer expiration time if offer id exists.

Code

In [69]:
idx = transcript_clean[transcript_clean.offer_received == 0].index

transcript_clean['expiration'].iloc[idx] = None

transcript_clean.expiration = transcript_clean.expiration.fillna(value=transcript_clean.time)

transcript_clean['expiration'] = transcript_clean.expiration.astype(int)

idx = transcript_clean[(transcript_clean.offer_id != '') 
                       & (transcript_clean.offer_received == 0)].index

transcript_clean['expiration'].iloc[idx] = None

transcript_clean.expiration = transcript_clean.expiration.fillna(method = 'ffill')

transcript_clean['expiration'] = transcript_clean.expiration.astype(int)

Test

In [70]:
transcript_clean.head(20)
Out[70]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward amount
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 0.00
1 0009655768c64bdeb2e877511632db8f 192 240 5a8bc65990b245e5a138643cd4eb9837 0 1 0 0 0 0.00
2 0009655768c64bdeb2e877511632db8f 228 228 0 0 0 1 0 22.16
3 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.00
4 0009655768c64bdeb2e877511632db8f 372 432 3f207df678b143eea3cee63160fa8bed 0 1 0 0 0 0.00
5 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00
6 0009655768c64bdeb2e877511632db8f 414 528 f19421c1d4aa40978ebb69ca19b0e20d 0 0 1 1 5 8.57
7 0009655768c64bdeb2e877511632db8f 456 528 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00
8 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 0.00
9 0009655768c64bdeb2e877511632db8f 528 744 fafdcd668e3743c1bb461111dcafc2a4 0 0 1 1 2 14.11
10 0009655768c64bdeb2e877511632db8f 540 744 fafdcd668e3743c1bb461111dcafc2a4 0 1 0 0 0 0.00
11 0009655768c64bdeb2e877511632db8f 552 552 0 0 0 1 0 13.56
12 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 10.27
13 0009655768c64bdeb2e877511632db8f 660 660 0 0 0 1 0 12.36
14 0009655768c64bdeb2e877511632db8f 690 690 0 0 0 1 0 28.16
15 0009655768c64bdeb2e877511632db8f 696 696 0 0 0 1 0 18.41
16 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00
17 00116118485d4dfda04fdbaba9a87b5c 216 288 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00
18 00116118485d4dfda04fdbaba9a87b5c 294 294 0 0 0 1 0 0.70
19 00116118485d4dfda04fdbaba9a87b5c 456 456 0 0 0 1 0 0.20

5. We will use time columns information to create new columns: offer_received_time, offer_viewed_time, offer_completed_time

Code

In [71]:
transcript_clean['offer_received_time']=transcript_clean[transcript_clean.offer_received == 1]['time']

transcript_clean['offer_viewed_time']=transcript_clean[transcript_clean.offer_viewed == 1]['time']

transcript_clean['offer_completed_time']=transcript_clean[transcript_clean.offer_completed == 1]['time']

transcript_clean.offer_received_time.fillna(0,inplace=True)
transcript_clean.offer_viewed_time.fillna(0,inplace=True)
transcript_clean.offer_completed_time.fillna(0,inplace=True)

Test

In [72]:
transcript_clean.head(20)
Out[72]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward amount offer_received_time offer_viewed_time offer_completed_time
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 0.00 168.0 0.0 0.0
1 0009655768c64bdeb2e877511632db8f 192 240 5a8bc65990b245e5a138643cd4eb9837 0 1 0 0 0 0.00 0.0 192.0 0.0
2 0009655768c64bdeb2e877511632db8f 228 228 0 0 0 1 0 22.16 0.0 0.0 0.0
3 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.00 336.0 0.0 0.0
4 0009655768c64bdeb2e877511632db8f 372 432 3f207df678b143eea3cee63160fa8bed 0 1 0 0 0 0.00 0.0 372.0 0.0
5 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00 408.0 0.0 0.0
6 0009655768c64bdeb2e877511632db8f 414 528 f19421c1d4aa40978ebb69ca19b0e20d 0 0 1 1 5 8.57 0.0 0.0 414.0
7 0009655768c64bdeb2e877511632db8f 456 528 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00 0.0 456.0 0.0
8 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 0.00 504.0 0.0 0.0
9 0009655768c64bdeb2e877511632db8f 528 744 fafdcd668e3743c1bb461111dcafc2a4 0 0 1 1 2 14.11 0.0 0.0 528.0
10 0009655768c64bdeb2e877511632db8f 540 744 fafdcd668e3743c1bb461111dcafc2a4 0 1 0 0 0 0.00 0.0 540.0 0.0
11 0009655768c64bdeb2e877511632db8f 552 552 0 0 0 1 0 13.56 0.0 0.0 0.0
12 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 10.27 576.0 0.0 576.0
13 0009655768c64bdeb2e877511632db8f 660 660 0 0 0 1 0 12.36 0.0 0.0 0.0
14 0009655768c64bdeb2e877511632db8f 690 690 0 0 0 1 0 28.16 0.0 0.0 0.0
15 0009655768c64bdeb2e877511632db8f 696 696 0 0 0 1 0 18.41 0.0 0.0 0.0
16 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00 168.0 0.0 0.0
17 00116118485d4dfda04fdbaba9a87b5c 216 288 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00 0.0 216.0 0.0
18 00116118485d4dfda04fdbaba9a87b5c 294 294 0 0 0 1 0 0.70 0.0 0.0 0.0
19 00116118485d4dfda04fdbaba9a87b5c 456 456 0 0 0 1 0 0.20 0.0 0.0 0.0

6. A person can receive the same offer multiple times. To consolidate transaction records associated within offer expiration time, we will create a new column "offerid_expiration" and use this column to group the transactions.

Code

In [73]:
transcript_clean['offerid_expiration'] = ''

idx = transcript_clean[transcript_clean.offer_id != ''].index

transcript_clean['expiration'] = transcript_clean.expiration.astype(str)

transcript_clean['offerid_expiration'].iloc[idx] = transcript_clean['offer_id'].iloc[idx] + transcript_clean['expiration'].iloc[idx]

transcript_clean['expiration'] = transcript_clean.expiration.astype(int)

Test

In [74]:
transcript_clean.head(20)
Out[74]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward amount offer_received_time offer_viewed_time offer_completed_time offerid_expiration
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 0.00 168.0 0.0 0.0 5a8bc65990b245e5a138643cd4eb9837240
1 0009655768c64bdeb2e877511632db8f 192 240 5a8bc65990b245e5a138643cd4eb9837 0 1 0 0 0 0.00 0.0 192.0 0.0 5a8bc65990b245e5a138643cd4eb9837240
2 0009655768c64bdeb2e877511632db8f 228 228 0 0 0 1 0 22.16 0.0 0.0 0.0
3 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.00 336.0 0.0 0.0 3f207df678b143eea3cee63160fa8bed432
4 0009655768c64bdeb2e877511632db8f 372 432 3f207df678b143eea3cee63160fa8bed 0 1 0 0 0 0.00 0.0 372.0 0.0 3f207df678b143eea3cee63160fa8bed432
5 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00 408.0 0.0 0.0 f19421c1d4aa40978ebb69ca19b0e20d528
6 0009655768c64bdeb2e877511632db8f 414 528 f19421c1d4aa40978ebb69ca19b0e20d 0 0 1 1 5 8.57 0.0 0.0 414.0 f19421c1d4aa40978ebb69ca19b0e20d528
7 0009655768c64bdeb2e877511632db8f 456 528 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00 0.0 456.0 0.0 f19421c1d4aa40978ebb69ca19b0e20d528
8 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 0.00 504.0 0.0 0.0 fafdcd668e3743c1bb461111dcafc2a4744
9 0009655768c64bdeb2e877511632db8f 528 744 fafdcd668e3743c1bb461111dcafc2a4 0 0 1 1 2 14.11 0.0 0.0 528.0 fafdcd668e3743c1bb461111dcafc2a4744
10 0009655768c64bdeb2e877511632db8f 540 744 fafdcd668e3743c1bb461111dcafc2a4 0 1 0 0 0 0.00 0.0 540.0 0.0 fafdcd668e3743c1bb461111dcafc2a4744
11 0009655768c64bdeb2e877511632db8f 552 552 0 0 0 1 0 13.56 0.0 0.0 0.0
12 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 10.27 576.0 0.0 576.0 2906b810c7d4411798c6938adc9daaa5744
13 0009655768c64bdeb2e877511632db8f 660 660 0 0 0 1 0 12.36 0.0 0.0 0.0
14 0009655768c64bdeb2e877511632db8f 690 690 0 0 0 1 0 28.16 0.0 0.0 0.0
15 0009655768c64bdeb2e877511632db8f 696 696 0 0 0 1 0 18.41 0.0 0.0 0.0
16 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00 168.0 0.0 0.0 f19421c1d4aa40978ebb69ca19b0e20d288
17 00116118485d4dfda04fdbaba9a87b5c 216 288 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00 0.0 216.0 0.0 f19421c1d4aa40978ebb69ca19b0e20d288
18 00116118485d4dfda04fdbaba9a87b5c 294 294 0 0 0 1 0 0.70 0.0 0.0 0.0
19 00116118485d4dfda04fdbaba9a87b5c 456 456 0 0 0 1 0 0.20 0.0 0.0 0.0

7. Consolidate transaction records associated within offer expiration time

Code

In [75]:
transcript_time = transcript_clean.groupby(['person','offerid_expiration'], as_index=False)[['amount',
                                                                                         'offer_id',
                                                                                         'offer_received_time',
                                                                                         'offer_viewed_time',
                                                                                         'offer_completed_time']].max()
In [76]:
transcript_clean.drop(columns=['offer_received_time','offer_viewed_time','offer_completed_time'],
                          inplace=True)
    
transcript_clean =transcript_clean.merge(transcript_time,
                                         left_on=['person','offerid_expiration'],
                                         right_on = ['person','offerid_expiration'],
                                         how= 'outer')
In [77]:
transcript_clean.fillna(0,inplace=True)

transcript_clean.drop(columns=['offerid_expiration','offer_id_y'],inplace=True)

transcript_clean.rename(columns={'offer_id_x':'offer_id'},inplace=True)
In [78]:
transcript_clean = transcript_clean.sort_values(by=['person','time'])

Test

In [79]:
transcript_clean.head(20)
Out[79]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward amount_x amount_y offer_received_time offer_viewed_time offer_completed_time
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 0.00 0.00 168.0 192.0 0.0
1 0009655768c64bdeb2e877511632db8f 192 240 5a8bc65990b245e5a138643cd4eb9837 0 1 0 0 0 0.00 0.00 168.0 192.0 0.0
2 0009655768c64bdeb2e877511632db8f 228 228 0 0 0 1 0 22.16 28.16 0.0 0.0 0.0
7 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.00 0.00 336.0 372.0 0.0
8 0009655768c64bdeb2e877511632db8f 372 432 3f207df678b143eea3cee63160fa8bed 0 1 0 0 0 0.00 0.00 336.0 372.0 0.0
9 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00 8.57 408.0 456.0 414.0
10 0009655768c64bdeb2e877511632db8f 414 528 f19421c1d4aa40978ebb69ca19b0e20d 0 0 1 1 5 8.57 8.57 408.0 456.0 414.0
11 0009655768c64bdeb2e877511632db8f 456 528 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00 8.57 408.0 456.0 414.0
12 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 0.00 14.11 504.0 540.0 528.0
13 0009655768c64bdeb2e877511632db8f 528 744 fafdcd668e3743c1bb461111dcafc2a4 0 0 1 1 2 14.11 14.11 504.0 540.0 528.0
14 0009655768c64bdeb2e877511632db8f 540 744 fafdcd668e3743c1bb461111dcafc2a4 0 1 0 0 0 0.00 14.11 504.0 540.0 528.0
3 0009655768c64bdeb2e877511632db8f 552 552 0 0 0 1 0 13.56 28.16 0.0 0.0 0.0
15 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 10.27 10.27 576.0 0.0 576.0
4 0009655768c64bdeb2e877511632db8f 660 660 0 0 0 1 0 12.36 28.16 0.0 0.0 0.0
5 0009655768c64bdeb2e877511632db8f 690 690 0 0 0 1 0 28.16 28.16 0.0 0.0 0.0
6 0009655768c64bdeb2e877511632db8f 696 696 0 0 0 1 0 18.41 28.16 0.0 0.0 0.0
16 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00 0.00 168.0 216.0 0.0
17 00116118485d4dfda04fdbaba9a87b5c 216 288 f19421c1d4aa40978ebb69ca19b0e20d 0 1 0 0 0 0.00 0.00 168.0 216.0 0.0
18 00116118485d4dfda04fdbaba9a87b5c 294 294 0 0 0 1 0 0.70 3.19 0.0 0.0 0.0
19 00116118485d4dfda04fdbaba9a87b5c 456 456 0 0 0 1 0 0.20 3.19 0.0 0.0 0.0

8. We still have different transaction records for viewing/ completing. We will remove rows these rows as have already captured this information in offer received transaction.

Code

In [80]:
idx = transcript_clean[(transcript_clean.offer_id != '') & (transcript_clean.offer_received == 0)].index

transcript_clean.drop(labels=idx,inplace=True)
transcript_clean.reset_index(inplace=True,drop=True)

Test

In [81]:
transcript_clean.head(20)
Out[81]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward amount_x amount_y offer_received_time offer_viewed_time offer_completed_time
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 0.00 0.00 168.0 192.0 0.0
1 0009655768c64bdeb2e877511632db8f 228 228 0 0 0 1 0 22.16 28.16 0.0 0.0 0.0
2 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.00 0.00 336.0 372.0 0.0
3 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00 8.57 408.0 456.0 414.0
4 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 0.00 14.11 504.0 540.0 528.0
5 0009655768c64bdeb2e877511632db8f 552 552 0 0 0 1 0 13.56 28.16 0.0 0.0 0.0
6 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 10.27 10.27 576.0 0.0 576.0
7 0009655768c64bdeb2e877511632db8f 660 660 0 0 0 1 0 12.36 28.16 0.0 0.0 0.0
8 0009655768c64bdeb2e877511632db8f 690 690 0 0 0 1 0 28.16 28.16 0.0 0.0 0.0
9 0009655768c64bdeb2e877511632db8f 696 696 0 0 0 1 0 18.41 28.16 0.0 0.0 0.0
10 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00 0.00 168.0 216.0 0.0
11 00116118485d4dfda04fdbaba9a87b5c 294 294 0 0 0 1 0 0.70 3.19 0.0 0.0 0.0
12 00116118485d4dfda04fdbaba9a87b5c 456 456 0 0 0 1 0 0.20 3.19 0.0 0.0 0.0
13 00116118485d4dfda04fdbaba9a87b5c 474 474 0 0 0 1 0 3.19 3.19 0.0 0.0 0.0
14 00116118485d4dfda04fdbaba9a87b5c 576 696 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 0.00 0.00 576.0 630.0 0.0
15 0011e0d4e6b944f998e987f904e8c1e5 0 96 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.00 0.00 0.0 6.0 0.0
16 0011e0d4e6b944f998e987f904e8c1e5 132 132 0 0 0 1 0 13.49 23.03 0.0 0.0 0.0
17 0011e0d4e6b944f998e987f904e8c1e5 168 336 2298d6c36e964ae4a3e7e9706d1fb8c2 1 0 0 0 0 0.00 11.93 168.0 186.0 252.0
18 0011e0d4e6b944f998e987f904e8c1e5 336 408 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 0.00 0.00 336.0 354.0 0.0
19 0011e0d4e6b944f998e987f904e8c1e5 408 648 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 0 0 0 0 0.00 0.00 408.0 432.0 0.0

9. When we consolidated the transactions, for purchases that were performed without coupon, "amount_y" column is populated by maximum amount spent by the person. We need to correct this.

Code

In [82]:
transcript_clean['amount']= transcript_clean[transcript_clean.offer_id == '']['amount_x']

transcript_clean['amount']= transcript_clean.amount.fillna(value=transcript_clean.amount_y)

transcript_clean.drop(columns=['amount_x','amount_y'],inplace=True)

Test

In [83]:
transcript_clean.head(20)
Out[83]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward offer_received_time offer_viewed_time offer_completed_time amount
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 168.0 192.0 0.0 0.00
1 0009655768c64bdeb2e877511632db8f 228 228 0 0 0 1 0 0.0 0.0 0.0 22.16
2 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 336.0 372.0 0.0 0.00
3 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 408.0 456.0 414.0 8.57
4 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 504.0 540.0 528.0 14.11
5 0009655768c64bdeb2e877511632db8f 552 552 0 0 0 1 0 0.0 0.0 0.0 13.56
6 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 576.0 0.0 576.0 10.27
7 0009655768c64bdeb2e877511632db8f 660 660 0 0 0 1 0 0.0 0.0 0.0 12.36
8 0009655768c64bdeb2e877511632db8f 690 690 0 0 0 1 0 0.0 0.0 0.0 28.16
9 0009655768c64bdeb2e877511632db8f 696 696 0 0 0 1 0 0.0 0.0 0.0 18.41
10 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 168.0 216.0 0.0 0.00
11 00116118485d4dfda04fdbaba9a87b5c 294 294 0 0 0 1 0 0.0 0.0 0.0 0.70
12 00116118485d4dfda04fdbaba9a87b5c 456 456 0 0 0 1 0 0.0 0.0 0.0 0.20
13 00116118485d4dfda04fdbaba9a87b5c 474 474 0 0 0 1 0 0.0 0.0 0.0 3.19
14 00116118485d4dfda04fdbaba9a87b5c 576 696 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 576.0 630.0 0.0 0.00
15 0011e0d4e6b944f998e987f904e8c1e5 0 96 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.0 6.0 0.0 0.00
16 0011e0d4e6b944f998e987f904e8c1e5 132 132 0 0 0 1 0 0.0 0.0 0.0 13.49
17 0011e0d4e6b944f998e987f904e8c1e5 168 336 2298d6c36e964ae4a3e7e9706d1fb8c2 1 0 0 0 0 168.0 186.0 252.0 11.93
18 0011e0d4e6b944f998e987f904e8c1e5 336 408 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 336.0 354.0 0.0 0.00
19 0011e0d4e6b944f998e987f904e8c1e5 408 648 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 0 0 0 0 408.0 432.0 0.0 0.00

10. For regular transactions, we still have the expiration column populated. We will fill the expiration with 0.

Code

In [84]:
idx = transcript_clean[transcript_clean.offer_id == ''].index

transcript_clean['expiration'].iloc[idx] = 0

Test

In [85]:
transcript_clean.head(20)
Out[85]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward offer_received_time offer_viewed_time offer_completed_time amount
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 168.0 192.0 0.0 0.00
1 0009655768c64bdeb2e877511632db8f 228 0 0 0 0 1 0 0.0 0.0 0.0 22.16
2 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 336.0 372.0 0.0 0.00
3 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 408.0 456.0 414.0 8.57
4 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 504.0 540.0 528.0 14.11
5 0009655768c64bdeb2e877511632db8f 552 0 0 0 0 1 0 0.0 0.0 0.0 13.56
6 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 576.0 0.0 576.0 10.27
7 0009655768c64bdeb2e877511632db8f 660 0 0 0 0 1 0 0.0 0.0 0.0 12.36
8 0009655768c64bdeb2e877511632db8f 690 0 0 0 0 1 0 0.0 0.0 0.0 28.16
9 0009655768c64bdeb2e877511632db8f 696 0 0 0 0 1 0 0.0 0.0 0.0 18.41
10 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 168.0 216.0 0.0 0.00
11 00116118485d4dfda04fdbaba9a87b5c 294 0 0 0 0 1 0 0.0 0.0 0.0 0.70
12 00116118485d4dfda04fdbaba9a87b5c 456 0 0 0 0 1 0 0.0 0.0 0.0 0.20
13 00116118485d4dfda04fdbaba9a87b5c 474 0 0 0 0 1 0 0.0 0.0 0.0 3.19
14 00116118485d4dfda04fdbaba9a87b5c 576 696 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 576.0 630.0 0.0 0.00
15 0011e0d4e6b944f998e987f904e8c1e5 0 96 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.0 6.0 0.0 0.00
16 0011e0d4e6b944f998e987f904e8c1e5 132 0 0 0 0 1 0 0.0 0.0 0.0 13.49
17 0011e0d4e6b944f998e987f904e8c1e5 168 336 2298d6c36e964ae4a3e7e9706d1fb8c2 1 0 0 0 0 168.0 186.0 252.0 11.93
18 0011e0d4e6b944f998e987f904e8c1e5 336 408 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 336.0 354.0 0.0 0.00
19 0011e0d4e6b944f998e987f904e8c1e5 408 648 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 0 0 0 0 408.0 432.0 0.0 0.00

11. A user is deemed to be influenced by promotion only after the individual made a transaction after viewing the advertisement. We will create a new column and populate if the promotion or not influence the individual.

Code

In [86]:
idx = transcript_clean[(transcript_clean.offer_viewed_time >0) 
                           & (transcript_clean.offer_viewed_time >  transcript_clean.offer_received_time)
                           & (transcript_clean.offer_completed_time > transcript_clean.offer_viewed_time)].index
    
transcript_clean['influenced'] = 0

transcript_clean['influenced'].iloc[idx] = 1

Test

In [87]:
transcript_clean.head(20)
Out[87]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward offer_received_time offer_viewed_time offer_completed_time amount influenced
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 168.0 192.0 0.0 0.00 0
1 0009655768c64bdeb2e877511632db8f 228 0 0 0 0 1 0 0.0 0.0 0.0 22.16 0
2 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 336.0 372.0 0.0 0.00 0
3 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 408.0 456.0 414.0 8.57 0
4 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 504.0 540.0 528.0 14.11 0
5 0009655768c64bdeb2e877511632db8f 552 0 0 0 0 1 0 0.0 0.0 0.0 13.56 0
6 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 576.0 0.0 576.0 10.27 0
7 0009655768c64bdeb2e877511632db8f 660 0 0 0 0 1 0 0.0 0.0 0.0 12.36 0
8 0009655768c64bdeb2e877511632db8f 690 0 0 0 0 1 0 0.0 0.0 0.0 28.16 0
9 0009655768c64bdeb2e877511632db8f 696 0 0 0 0 1 0 0.0 0.0 0.0 18.41 0
10 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 168.0 216.0 0.0 0.00 0
11 00116118485d4dfda04fdbaba9a87b5c 294 0 0 0 0 1 0 0.0 0.0 0.0 0.70 0
12 00116118485d4dfda04fdbaba9a87b5c 456 0 0 0 0 1 0 0.0 0.0 0.0 0.20 0
13 00116118485d4dfda04fdbaba9a87b5c 474 0 0 0 0 1 0 0.0 0.0 0.0 3.19 0
14 00116118485d4dfda04fdbaba9a87b5c 576 696 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 576.0 630.0 0.0 0.00 0
15 0011e0d4e6b944f998e987f904e8c1e5 0 96 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0.0 6.0 0.0 0.00 0
16 0011e0d4e6b944f998e987f904e8c1e5 132 0 0 0 0 1 0 0.0 0.0 0.0 13.49 0
17 0011e0d4e6b944f998e987f904e8c1e5 168 336 2298d6c36e964ae4a3e7e9706d1fb8c2 1 0 0 0 0 168.0 186.0 252.0 11.93 1
18 0011e0d4e6b944f998e987f904e8c1e5 336 408 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 336.0 354.0 0.0 0.00 0
19 0011e0d4e6b944f998e987f904e8c1e5 408 648 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 0 0 0 0 408.0 432.0 0.0 0.00 0

12. Create a new column to capture transaction time.

Code

In [88]:
transcript_clean['offer_received_time'] = transcript_clean.offer_received_time.astype(int)
transcript_clean['offer_viewed_time'] = transcript_clean.offer_viewed_time.astype(int)
transcript_clean['offer_completed_time'] = transcript_clean.offer_completed_time.astype(int)

transcript_clean['transaction_time'] = 0

idx = transcript_clean[transcript_clean.transaction == 1].index

transcript_clean['transaction_time'].iloc[idx] = transcript_clean['time'].iloc[idx]

idx = transcript_clean[transcript_clean.transaction == 0].index

transcript_clean['transaction_time'].iloc[idx] = transcript_clean['offer_completed_time'].iloc[idx]

Test

In [89]:
transcript_clean.head(20)
Out[89]:
person time expiration offer_id offer_received offer_viewed offer_completed transaction reward offer_received_time offer_viewed_time offer_completed_time amount influenced transaction_time
0 0009655768c64bdeb2e877511632db8f 168 240 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 168 192 0 0.00 0 0
1 0009655768c64bdeb2e877511632db8f 228 0 0 0 0 1 0 0 0 0 22.16 0 228
2 0009655768c64bdeb2e877511632db8f 336 432 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 336 372 0 0.00 0 0
3 0009655768c64bdeb2e877511632db8f 408 528 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 408 456 414 8.57 0 414
4 0009655768c64bdeb2e877511632db8f 504 744 fafdcd668e3743c1bb461111dcafc2a4 1 0 0 0 0 504 540 528 14.11 0 528
5 0009655768c64bdeb2e877511632db8f 552 0 0 0 0 1 0 0 0 0 13.56 0 552
6 0009655768c64bdeb2e877511632db8f 576 744 2906b810c7d4411798c6938adc9daaa5 1 0 1 1 2 576 0 576 10.27 0 576
7 0009655768c64bdeb2e877511632db8f 660 0 0 0 0 1 0 0 0 0 12.36 0 660
8 0009655768c64bdeb2e877511632db8f 690 0 0 0 0 1 0 0 0 0 28.16 0 690
9 0009655768c64bdeb2e877511632db8f 696 0 0 0 0 1 0 0 0 0 18.41 0 696
10 00116118485d4dfda04fdbaba9a87b5c 168 288 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 168 216 0 0.00 0 0
11 00116118485d4dfda04fdbaba9a87b5c 294 0 0 0 0 1 0 0 0 0 0.70 0 294
12 00116118485d4dfda04fdbaba9a87b5c 456 0 0 0 0 1 0 0 0 0 0.20 0 456
13 00116118485d4dfda04fdbaba9a87b5c 474 0 0 0 0 1 0 0 0 0 3.19 0 474
14 00116118485d4dfda04fdbaba9a87b5c 576 696 f19421c1d4aa40978ebb69ca19b0e20d 1 0 0 0 0 576 630 0 0.00 0 0
15 0011e0d4e6b944f998e987f904e8c1e5 0 96 3f207df678b143eea3cee63160fa8bed 1 0 0 0 0 0 6 0 0.00 0 0
16 0011e0d4e6b944f998e987f904e8c1e5 132 0 0 0 0 1 0 0 0 0 13.49 0 132
17 0011e0d4e6b944f998e987f904e8c1e5 168 336 2298d6c36e964ae4a3e7e9706d1fb8c2 1 0 0 0 0 168 186 252 11.93 1 252
18 0011e0d4e6b944f998e987f904e8c1e5 336 408 5a8bc65990b245e5a138643cd4eb9837 1 0 0 0 0 336 354 0 0.00 0 0
19 0011e0d4e6b944f998e987f904e8c1e5 408 648 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 0 0 0 0 408 432 0 0.00 0 0

13. When the transactions are consolidated, we lost information about offer_received, offer_viewed, offer_completed columns. We need to populate with correct values.

Code

In [90]:
transcript_clean['offer_received'] = 0

idx = transcript_clean[transcript_clean.offer_received_time > 0].index

transcript_clean['offer_received'].iloc[idx] = 1

transcript_clean['offer_viewed'] = 0

idx = transcript_clean[transcript_clean.offer_viewed_time > 0].index

transcript_clean['offer_viewed'].iloc[idx] = 1


transcript_clean['offer_completed'] = 0

idx = transcript_clean[transcript_clean.offer_completed_time > 0].index

transcript_clean['offer_completed'].iloc[idx] = 1

transcript_clean = transcript_clean[['person','offer_id', 'time','offer_received_time', 'offer_viewed_time', 
                                     'offer_completed_time','transaction_time','expiration','offer_received',
                                     'offer_viewed','offer_completed', 'transaction', 'reward','amount',
                                     'influenced']]

Test

In [91]:
transcript_clean.head(20)
Out[91]:
person offer_id time offer_received_time offer_viewed_time offer_completed_time transaction_time expiration offer_received offer_viewed offer_completed transaction reward amount influenced
0 0009655768c64bdeb2e877511632db8f 5a8bc65990b245e5a138643cd4eb9837 168 168 192 0 0 240 1 1 0 0 0 0.00 0
1 0009655768c64bdeb2e877511632db8f 228 0 0 0 228 0 0 0 0 1 0 22.16 0
2 0009655768c64bdeb2e877511632db8f 3f207df678b143eea3cee63160fa8bed 336 336 372 0 0 432 1 1 0 0 0 0.00 0
3 0009655768c64bdeb2e877511632db8f f19421c1d4aa40978ebb69ca19b0e20d 408 408 456 414 414 528 1 1 1 0 0 8.57 0
4 0009655768c64bdeb2e877511632db8f fafdcd668e3743c1bb461111dcafc2a4 504 504 540 528 528 744 1 1 1 0 0 14.11 0
5 0009655768c64bdeb2e877511632db8f 552 0 0 0 552 0 0 0 0 1 0 13.56 0
6 0009655768c64bdeb2e877511632db8f 2906b810c7d4411798c6938adc9daaa5 576 576 0 576 576 744 1 0 1 1 2 10.27 0
7 0009655768c64bdeb2e877511632db8f 660 0 0 0 660 0 0 0 0 1 0 12.36 0
8 0009655768c64bdeb2e877511632db8f 690 0 0 0 690 0 0 0 0 1 0 28.16 0
9 0009655768c64bdeb2e877511632db8f 696 0 0 0 696 0 0 0 0 1 0 18.41 0
10 00116118485d4dfda04fdbaba9a87b5c f19421c1d4aa40978ebb69ca19b0e20d 168 168 216 0 0 288 1 1 0 0 0 0.00 0
11 00116118485d4dfda04fdbaba9a87b5c 294 0 0 0 294 0 0 0 0 1 0 0.70 0
12 00116118485d4dfda04fdbaba9a87b5c 456 0 0 0 456 0 0 0 0 1 0 0.20 0
13 00116118485d4dfda04fdbaba9a87b5c 474 0 0 0 474 0 0 0 0 1 0 3.19 0
14 00116118485d4dfda04fdbaba9a87b5c f19421c1d4aa40978ebb69ca19b0e20d 576 576 630 0 0 696 1 1 0 0 0 0.00 0
15 0011e0d4e6b944f998e987f904e8c1e5 3f207df678b143eea3cee63160fa8bed 0 0 6 0 0 96 0 1 0 0 0 0.00 0
16 0011e0d4e6b944f998e987f904e8c1e5 132 0 0 0 132 0 0 0 0 1 0 13.49 0
17 0011e0d4e6b944f998e987f904e8c1e5 2298d6c36e964ae4a3e7e9706d1fb8c2 168 168 186 252 252 336 1 1 1 0 0 11.93 1
18 0011e0d4e6b944f998e987f904e8c1e5 5a8bc65990b245e5a138643cd4eb9837 336 336 354 0 0 408 1 1 0 0 0 0.00 0
19 0011e0d4e6b944f998e987f904e8c1e5 0b1e1539f2cc45b7b9fa7c272da2e1d7 408 408 432 0 0 648 1 1 0 0 0 0.00 0

14. We no longer need "time" and "expiration" information anymore. We will drop these columns.

Code

In [92]:
transcript_clean.drop(columns=['time','expiration'],inplace=True)

Test

In [93]:
transcript_clean.head(20)
Out[93]:
person offer_id offer_received_time offer_viewed_time offer_completed_time transaction_time offer_received offer_viewed offer_completed transaction reward amount influenced
0 0009655768c64bdeb2e877511632db8f 5a8bc65990b245e5a138643cd4eb9837 168 192 0 0 1 1 0 0 0 0.00 0
1 0009655768c64bdeb2e877511632db8f 0 0 0 228 0 0 0 1 0 22.16 0
2 0009655768c64bdeb2e877511632db8f 3f207df678b143eea3cee63160fa8bed 336 372 0 0 1 1 0 0 0 0.00 0
3 0009655768c64bdeb2e877511632db8f f19421c1d4aa40978ebb69ca19b0e20d 408 456 414 414 1 1 1 0 0 8.57 0
4 0009655768c64bdeb2e877511632db8f fafdcd668e3743c1bb461111dcafc2a4 504 540 528 528 1 1 1 0 0 14.11 0
5 0009655768c64bdeb2e877511632db8f 0 0 0 552 0 0 0 1 0 13.56 0
6 0009655768c64bdeb2e877511632db8f 2906b810c7d4411798c6938adc9daaa5 576 0 576 576 1 0 1 1 2 10.27 0
7 0009655768c64bdeb2e877511632db8f 0 0 0 660 0 0 0 1 0 12.36 0
8 0009655768c64bdeb2e877511632db8f 0 0 0 690 0 0 0 1 0 28.16 0
9 0009655768c64bdeb2e877511632db8f 0 0 0 696 0 0 0 1 0 18.41 0
10 00116118485d4dfda04fdbaba9a87b5c f19421c1d4aa40978ebb69ca19b0e20d 168 216 0 0 1 1 0 0 0 0.00 0
11 00116118485d4dfda04fdbaba9a87b5c 0 0 0 294 0 0 0 1 0 0.70 0
12 00116118485d4dfda04fdbaba9a87b5c 0 0 0 456 0 0 0 1 0 0.20 0
13 00116118485d4dfda04fdbaba9a87b5c 0 0 0 474 0 0 0 1 0 3.19 0
14 00116118485d4dfda04fdbaba9a87b5c f19421c1d4aa40978ebb69ca19b0e20d 576 630 0 0 1 1 0 0 0 0.00 0
15 0011e0d4e6b944f998e987f904e8c1e5 3f207df678b143eea3cee63160fa8bed 0 6 0 0 0 1 0 0 0 0.00 0
16 0011e0d4e6b944f998e987f904e8c1e5 0 0 0 132 0 0 0 1 0 13.49 0
17 0011e0d4e6b944f998e987f904e8c1e5 2298d6c36e964ae4a3e7e9706d1fb8c2 168 186 252 252 1 1 1 0 0 11.93 1
18 0011e0d4e6b944f998e987f904e8c1e5 5a8bc65990b245e5a138643cd4eb9837 336 354 0 0 1 1 0 0 0 0.00 0
19 0011e0d4e6b944f998e987f904e8c1e5 0b1e1539f2cc45b7b9fa7c272da2e1d7 408 432 0 0 1 1 0 0 0 0.00 0

We will consolidate all the cleaning steps into one single function.

In [94]:
def clean_transcript(transcript_clone = transcript.copy()):
    
    '''
        Create dummy columns out of event column
    '''
    transcript_clone =pd.get_dummies(transcript_clone,columns=['event'])
    
    transcript_clone.rename(columns={'event_offer completed':'offer_completed',
                                     'event_offer received':'offer_received',
                                     'event_offer viewed':'offer_viewed',
                                     'event_transaction':'transaction'},
                            inplace=True)
    
    '''
        "value" column is a composite column that contains Offer ID, Reward and Amount information. We will extract 
        information into individual columns
    '''
    transcript_clone['offer_id'] = transcript_clone.value.apply(get_offer_id)
    transcript_clone['reward'] = transcript_clone.value.apply(get_reward)
    transcript_clone['amount'] = transcript_clone.value.apply(get_amount)
    
    transcript_clone.drop(columns=['value'],inplace=True)
    
    transcript_clone = transcript_clone[['person','time','offer_id','offer_received','offer_viewed','offer_completed',
                                         'transaction','reward','amount']]
    
    '''
        When an individual has utilized an offer, there are two transactions records created, one for claiming the 
        reward another for making the purchase. We are going to consolidate these two transaction records into one.
    '''
    transcript_clean = transcript_clone.groupby(['person','time'],as_index=False).agg('max')
    
    '''
         Each offer is valid only for a certain number of days once received. In the current data frame, we do not 
         have this information. For successful completion of the offer, the offer should be utilized before expiration.
    '''
    transcript_clean['duration'] = transcript_clean[transcript_clean.offer_received == 1].offer_id.apply(get_duration)
    
    transcript_clean.duration.fillna(0,inplace=True)
    
    transcript_clean['duration'] = transcript_clean.duration.apply(lambda x:x*24)
    
    transcript_clean['expiration'] = transcript_clean.time + transcript_clean.duration
    
    transcript_clean.drop(columns='duration',inplace=True)
    
    transcript_clean = transcript_clean[['person', 'time', 'expiration','offer_id', 'offer_received', 'offer_viewed',
                                         'offer_completed', 'transaction', 'reward', 'amount']]
    
    transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
    
    '''
        From the above output, it looks like we have populated the transactions that are not offer received with the 
        transaction timestamp. We need to fill with correct offer expiration time if offer id exists.
    '''
    idx = transcript_clean[transcript_clean.offer_received == 0].index
    
    transcript_clean['expiration'].iloc[idx] = None
    
    transcript_clean.expiration = transcript_clean.expiration.fillna(value=transcript_clean.time)
    
    transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
    
    idx = transcript_clean[(transcript_clean.offer_id != '') 
                           & (transcript_clean.offer_received == 0)].index

    transcript_clean['expiration'].iloc[idx] = None
    
    transcript_clean.expiration = transcript_clean.expiration.fillna(method = 'ffill')
    
    transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
    
    '''
        We will use time columns information to create new columns: offer_received_time, offer_viewed_time, 
        offer_completed_time
    '''
    transcript_clean['offer_received_time']=transcript_clean[transcript_clean.offer_received == 1]['time']
    
    transcript_clean['offer_viewed_time']=transcript_clean[transcript_clean.offer_viewed == 1]['time']
    
    transcript_clean['offer_completed_time']=transcript_clean[transcript_clean.offer_completed == 1]['time']
    
    transcript_clean.offer_received_time.fillna(0,inplace=True)
    transcript_clean.offer_viewed_time.fillna(0,inplace=True)
    transcript_clean.offer_completed_time.fillna(0,inplace=True)
    
    
    '''
         A person can receive the same offer multiple times. To consolidate transaction records associated within 
         offer expiration time, we will create a new column "offerid_expiration" and use this column to group the 
         transactions.
    '''
    transcript_clean['offerid_expiration'] = ''
    
    idx = transcript_clean[transcript_clean.offer_id != ''].index
    
    transcript_clean['expiration'] = transcript_clean.expiration.astype(str)
    
    transcript_clean['offerid_expiration'].iloc[idx] = transcript_clean['offer_id'].iloc[idx] + transcript_clean['expiration'].iloc[idx]
    
    transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
    
    '''
        Consolidate transaction records associated within offer expiration time
    '''
    transcript_time = transcript_clean.groupby(['person','offerid_expiration'], as_index=False)[['amount',
                                                                                             'offer_id',
                                                                                             'offer_received_time',
                                                                                             'offer_viewed_time',
                                                                                             'offer_completed_time']].max()
    
    transcript_clean.drop(columns=['offer_received_time','offer_viewed_time','offer_completed_time'],
                          inplace=True)
    
    transcript_clean =transcript_clean.merge(transcript_time,
                                             left_on=['person','offerid_expiration'],
                                             right_on = ['person','offerid_expiration'],
                                             how= 'outer')
    
    transcript_clean.fillna(0,inplace=True)
    
    transcript_clean = transcript_clean.sort_values(by=['person','time'])
    
    transcript_clean.drop(columns=['offerid_expiration','offer_id_y'],inplace=True)

    transcript_clean.rename(columns={'offer_id_x':'offer_id'},inplace=True)
    
    '''
        We still have different transaction records for viewing/ completing. We will remove rows these rows as have 
        already captured this information in offer received transaction.
    '''
    idx = transcript_clean[(transcript_clean.offer_id != '') & (transcript_clean.offer_received == 0)].index
    
    transcript_clean.drop(labels=idx,inplace=True)
    transcript_clean.reset_index(inplace=True,drop=True)
    
    '''
        When we consolidated the transactions, for purchases that were performed without coupon, "amount_y" column is 
        populated by maximum amount spent by the person. We need to correct this.
    '''
    transcript_clean['amount']= transcript_clean[transcript_clean.offer_id == '']['amount_x']
    
    transcript_clean['amount']= transcript_clean.amount.fillna(value=transcript_clean.amount_y)
    
    transcript_clean.drop(columns=['amount_x','amount_y'],inplace=True)
    
    '''
         For regular transactions, we still have the expiration column populated. We will fill the expiration with 0.
    '''
    idx = transcript_clean[transcript_clean.offer_id == ''].index
    
    transcript_clean['expiration'].iloc[idx] = 0
    
    '''
        A user is deemed to be influenced by promotion only after the individual made a transaction after viewing the 
        advertisement. We will create a new column and populate if the promotion or not influence the individual. 
    '''
    idx = transcript_clean[(transcript_clean.offer_viewed_time >0) 
                           & (transcript_clean.offer_viewed_time >  transcript_clean.offer_received_time)
                           & (transcript_clean.offer_completed_time > transcript_clean.offer_viewed_time)].index
    
    transcript_clean['influenced'] = 0
    
    transcript_clean['influenced'].iloc[idx] = 1
    
    '''
        Create a new column to capture transaction time.
    '''
    transcript_clean['offer_received_time'] = transcript_clean.offer_received_time.astype(int)
    transcript_clean['offer_viewed_time'] = transcript_clean.offer_viewed_time.astype(int)
    transcript_clean['offer_completed_time'] = transcript_clean.offer_completed_time.astype(int)
    
    transcript_clean['transaction_time'] = 0
    
    idx = transcript_clean[transcript_clean.transaction == 1].index
    
    transcript_clean['transaction_time'].iloc[idx] = transcript_clean['time'].iloc[idx]
    
    idx = transcript_clean[transcript_clean.transaction == 0].index
    
    transcript_clean['transaction_time'].iloc[idx] = transcript_clean['offer_completed_time'].iloc[idx]
    
    '''
        When the transactions are consolidated, we lost information about  offer_received, offer_viewed, 
        offer_completed columns. We need to populate with correct values.
    '''
    
    transcript_clean['offer_received'] = 0

    idx = transcript_clean[transcript_clean.offer_received_time > 0].index

    transcript_clean['offer_received'].iloc[idx] = 1
    
    transcript_clean['offer_viewed'] = 0

    idx = transcript_clean[transcript_clean.offer_viewed_time > 0].index

    transcript_clean['offer_viewed'].iloc[idx] = 1
    
    
    transcript_clean['offer_completed'] = 0

    idx = transcript_clean[transcript_clean.offer_completed_time > 0].index

    transcript_clean['offer_completed'].iloc[idx] = 1
    
    transcript_clean = transcript_clean[['person','offer_id', 'time','offer_received_time', 'offer_viewed_time', 
                                         'offer_completed_time','transaction_time','expiration','offer_received',
                                         'offer_viewed','offer_completed', 'transaction', 'reward','amount',
                                         'influenced']]
    
    '''
        We no longer need "time" and "expiration" information. We will drop these columns.
    '''
    transcript_clean.drop(columns=['time','expiration'],inplace=True)
    
    del transcript_clone
    del transcript_time
    gc.collect()
    
    return transcript_clean
In [95]:
#transcript_clean_1 = clean_transcript()
#transcript_clean.equals(transcript_clean_1)
In [96]:
del transcript_clone
del transcript_time
#del transcript_clean_1
gc.collect()
Out[96]:
7
In [97]:
transcript_clean.to_csv('data/transcript_clean.csv',index=False)

Now that we have all three data frames cleaned, lets consolidate into one data frame

In [98]:
transaction = transcript_clean.groupby(['person','offer_id'],as_index=False).sum()
In [99]:
transaction.influenced.replace(to_replace=2, value=1,inplace=True)
In [100]:
transaction.influenced.replace(to_replace=3, value=1,inplace=True)
In [101]:
transaction.influenced.value_counts()
Out[101]:
0    70165
1     9023
Name: influenced, dtype: int64
In [102]:
transaction.drop(columns=['offer_received_time','offer_viewed_time','offer_completed_time',
                          'transaction_time'], inplace = True)
In [103]:
transaction = transaction.merge(profile_for_ml,left_on='person',right_on='id')
In [104]:
transaction.drop(columns=['person','id'],inplace=True)
In [105]:
transaction = transaction.merge(portfolio_for_ml,left_on=['offer_id'],right_on=['id'], how= 'left')
In [106]:
transaction.drop(columns=['offer_received','offer_viewed','offer_completed','transaction',
                          'offer_id','id','reward_y'],inplace=True)
In [107]:
transaction.fillna(0,inplace=True)
In [108]:
transaction[['difficulty','duration', 'bogo', 'discount', 'informational', 'email', 'mobile',
             'social', 'web', 'offer_code']] = transaction[['difficulty','duration', 'bogo', 'discount', 
                                                            'informational', 'email', 'mobile','social', 'web', 
                                                            'offer_code']].astype(int)
In [109]:
transaction.rename(columns={'reward_x':'reward'},inplace=True)
In [110]:
transaction = pd.get_dummies(transaction, columns=['offer_code'])
In [111]:
transaction = transaction[['age', 'income', 'gender_F','gender_M', 'gender_O', 'became_member_on_year',
                          'became_member_on_month','became_member_on_date','difficulty','duration',
                          'bogo', 'discount', 'informational', 'email', 'mobile','social', 'web','reward', 'amount',
                          'influenced','offer_code_0', 'offer_code_1', 'offer_code_2','offer_code_3', 'offer_code_4',
                           'offer_code_5', 'offer_code_6','offer_code_7', 'offer_code_8', 'offer_code_9',
                           'offer_code_10']]

We will consolidate all the cleaning steps into one single function.

In [112]:
def generate_transaction_without_dummies(transcript_clean,profile_for_ml,portfolio_for_ml):
    transaction = transcript_clean.groupby(['person','offer_id'],as_index=False).sum()
    
    transaction.influenced.replace(to_replace=2, value=1,inplace=True)
    transaction.influenced.replace(to_replace=3, value=1,inplace=True)
    
    transaction.drop(columns=['offer_received_time','offer_viewed_time','offer_completed_time',
                              'transaction_time'],
                     inplace = True)
    
    transaction = transaction.merge(profile_for_ml,left_on='person',right_on='id')
    
    transaction.drop(columns=['person','id'],inplace=True)
    
    transaction = transaction.merge(portfolio_for_ml,left_on=['offer_id'],right_on=['id'], how= 'left')
    
    transaction.drop(columns=['offer_received','offer_viewed','offer_completed','transaction','offer_id','id',
                              'reward_y'],
                     inplace=True)
    
    transaction.fillna(0,inplace=True)
    
    transaction[['difficulty','duration', 'bogo', 'discount', 'informational', 'email', 'mobile','social', 'web',
                 'offer_code']] = transaction[['difficulty','duration', 'bogo', 'discount', 'informational', 'email',
                                               'mobile','social', 'web', 'offer_code']].astype(int)
    
    transaction.rename(columns={'reward_x':'reward'},inplace=True)
    
    return transaction
In [113]:
def generate_transaction(transcript_clean,profile_for_ml,portfolio_for_ml):
    transaction = generate_transaction_without_dummies(transcript_clean,profile_for_ml,portfolio_for_ml)
    
    transaction = pd.get_dummies(transaction, columns=['offer_code'])
    
    transaction = transaction[['age', 'income', 'gender_F','gender_M', 'gender_O', 'became_member_on_year',
                          'became_member_on_month','became_member_on_date','difficulty','duration',
                          'bogo', 'discount', 'informational', 'email', 'mobile','social', 'web','reward', 'amount',
                          'influenced','offer_code_0', 'offer_code_1', 'offer_code_2','offer_code_3', 'offer_code_4',
                           'offer_code_5', 'offer_code_6','offer_code_7', 'offer_code_8', 'offer_code_9',
                           'offer_code_10']]
    
    return transaction
In [114]:
transaction_1 = generate_transaction(transcript_clean,profile_for_ml,portfolio_for_ml)
transaction.equals(transaction_1)
Out[114]:
True
In [115]:
del transaction_1
gc.collect()
Out[115]:
105

5.2. Data Analysis

Data analysis provides critical insights into the data and answers pertinent business questions. In this section, we will perform multivariant frequency distribution by:

* events by gender
* events by income
* gender by income
In [116]:
transcript_by_person = transcript_clean.groupby('person',as_index=False).sum()
In [117]:
transcript_by_person.drop(columns=['offer_received_time', 'offer_viewed_time','offer_completed_time',
                                   'transaction_time'],
                          inplace = True)
In [118]:
transcript_by_person = profile.merge(transcript_by_person,left_on='id',right_on='person')
In [119]:
transcript_by_person.drop(columns=['id', 'person'],
                          inplace = True)
In [120]:
event_by_gender = transcript_by_person.groupby('gender',as_index=False)[['offer_completed','offer_received',
                                                                        'offer_viewed','transaction',
                                                                        'influenced']].sum()
In [121]:
event_by_gender.gender.replace(to_replace='F',value='Female',inplace=True)
event_by_gender.gender.replace(to_replace='M',value='Male',inplace=True)
event_by_gender.gender.replace(to_replace='O',value='Other',inplace=True)
In [122]:
trace0 = go.Bar(x=event_by_gender.gender,
                y=event_by_gender.offer_received,
                text=event_by_gender.offer_received,
                textposition = 'auto',
                name = 'Offer Received',
                marker=dict(color='rgba(0,107,164,1)',))

trace1 = go.Bar(x=event_by_gender.gender,
                y=event_by_gender.offer_viewed,
                text=event_by_gender.offer_viewed,
                textposition = 'auto',
                name = 'Offer Viewed',
                marker = dict(color ='rgba(255,128,14,1)',))

trace2 = go.Bar(x=event_by_gender.gender,
                y=event_by_gender.offer_completed,
                text=event_by_gender.offer_completed,
                textposition = 'auto',
                name = 'Offer Completed',
                marker = dict(color ='rgba(171,171,171,1)',))

trace3 = go.Bar(x=event_by_gender.gender,
                y=event_by_gender.transaction,
                text=event_by_gender.transaction,
                textposition = 'auto',
                name = 'Transaction',
                marker = dict(color ='rgba(89,89,89,1)',))

trace4 = go.Bar(x=event_by_gender.gender,
                y=event_by_gender.influenced,
                text=event_by_gender.influenced,
                textposition = 'auto',
                name = 'Influenced',
                marker = dict(color ='rgba(95,158,209,1)',))

layout = go.Layout(
    title = 'Event Distribution by Gender',
    xaxis=dict(tickangle=-45),
    barmode='group',
    bargap=0.15
)

fig = go.Figure(data=go.Data([trace0,trace1,trace2,trace3,trace4]),
                layout=layout)

iplot(fig)
In [123]:
del event_by_gender
gc.collect()
Out[123]:
7
In [124]:
event_by_age = transcript_by_person.groupby('age',as_index=False)[['offer_completed','offer_received',
                                                                        'offer_viewed','transaction',
                                                                        'influenced']].sum()
In [125]:
trace0 = go.Scatter(x=transcript_by_person.age,
                    y=transcript_by_person.offer_received,
                    name = 'Offer Received',
                    mode = 'markers',
                    marker=dict(color='rgba(0,107,164,1)',))

trace1 = go.Scatter(x=transcript_by_person.age,
                    y=transcript_by_person.offer_viewed,
                    name = 'Offer Viewed',
                    mode = 'markers',
                    marker = dict(color ='rgba(255,128,14,1)',))

trace2 = go.Scatter(x=transcript_by_person.age,
                    y=transcript_by_person.offer_completed,
                    name = 'Offer Completed',
                    mode = 'markers',
                    marker = dict(color ='rgba(171,171,171,1)',))

trace3 = go.Scatter(x=transcript_by_person.age,
                    y=transcript_by_person.transaction,
                    name = 'Transaction',
                    mode = 'markers',
                    marker = dict(color ='rgba(89,89,89,1)',))

trace4 = go.Scatter(x=transcript_by_person.age,
                    y=transcript_by_person.influenced,
                    name = 'Influenced',
                    mode = 'markers',
                    marker = dict(color ='rgba(95,158,209,1)',))

layout = go.Layout(
    title = 'Event Distribution by Age',
    bargap=0.15
)

fig = go.Figure(data=go.Data([trace0,trace1,trace2,trace3,trace4]),
                layout=layout)

iplot(fig)
In [126]:
del event_by_age

gc.collect()
Out[126]:
12613
In [127]:
event_by_income = transcript_by_person.groupby('income',as_index=False)[['offer_completed','offer_received',
                                                                        'offer_viewed','transaction',
                                                                        'influenced']].sum()
In [128]:
trace0 = go.Scatter(x=event_by_income.income,
                    y=event_by_income.offer_received,
                    name = 'Offer Received',
                    marker=dict(color='rgba(0,107,164,1.25)',))

trace1 = go.Scatter(x=event_by_income.income,
                    y=event_by_income.offer_viewed,
                    name = 'Offer Viewed',
                    marker = dict(color ='rgba(255,128,14,1.25)',))

trace2 = go.Scatter(x=event_by_income.income,
                    y=event_by_income.offer_completed,
                    name = 'Offer Completed',
                    marker = dict(color ='rgba(200,82,0,1.25)',))

trace3 = go.Scatter(x=event_by_income.income,
                    y=event_by_income.transaction,
                    name = 'Transaction',
                    marker = dict(color ='rgba(89,89,89,1.25)',))

trace4 = go.Scatter(x=event_by_income.income,
                    y=event_by_income.influenced,
                    name = 'Influenced',
                    marker = dict(color ='rgba(95,158,209,1.25)',))

trace = go.Histogram(x=profile.income.values,
                     name='Income',
                     marker=dict(color='rgba(95,158,209,0.15)',),
                     yaxis='y2')

layout = go.Layout(
    title = 'Event Distribution by Income',
    yaxis2=dict(
        overlaying='y',
        side='right'
    )
)

fig = go.Figure(data=go.Data([trace0,trace1,trace2,trace3,trace4,trace]),
                layout=layout
               )

iplot(fig)
In [129]:
del event_by_income
del transcript_by_person

gc.collect()
Out[129]:
13077
In [130]:
transcript_by_offer = transcript_clean.groupby('offer_id',as_index=False).sum()
In [131]:
transcript_by_offer = portfolio.merge(transcript_by_offer,left_on='id',right_on='offer_id')
In [132]:
transcript_by_offer.drop(columns=['channels', 'difficulty', 'duration','reward_x','offer_id', 'offer_received_time',
                                  'offer_viewed_time','offer_completed_time', 'transaction_time','transaction',
                                  'reward_y', 'amount'],
                         inplace=True)
In [133]:
transcript_by_offer = transcript_by_offer.groupby('offer_type',as_index=False).sum()
In [134]:
trace0 = go.Bar(x=transcript_by_offer.offer_type,
                y=transcript_by_offer.offer_received,
                text=transcript_by_offer.offer_received,
                textposition = 'auto',
                name = 'Offer Received',
                marker=dict(color='rgba(0,107,164,1)',))

trace1 = go.Bar(x=transcript_by_offer.offer_type,
                y=transcript_by_offer.offer_viewed,
                text=transcript_by_offer.offer_viewed,
                textposition = 'auto',
                name = 'Offer Viewed',
                marker = dict(color ='rgba(255,128,14,1)',))

trace2 = go.Bar(x=transcript_by_offer.offer_type,
                y=transcript_by_offer.offer_completed,
                text=transcript_by_offer.offer_completed,
                textposition = 'auto',
                name = 'Offer Completed',
                marker = dict(color ='rgba(171,171,171,1)',))

trace3 = go.Bar(x=transcript_by_offer.offer_type,
                y=transcript_by_offer.influenced,
                text=transcript_by_offer.influenced,
                textposition = 'auto',
                name = 'Influenced',
                marker = dict(color ='rgba(89,89,89,1)',))

layout = go.Layout(
    title = 'Event Distribution by Offer Type',
    xaxis=dict(tickangle=-45),
    barmode='group',
    bargap=0.15
)

fig = go.Figure(data=go.Data([trace0,trace1,trace2,trace3]),
                layout=layout)

iplot(fig)
In [135]:
del transcript_by_offer
gc.collect()
Out[135]:
13444

6. Modeling

Modeling techniques are selected and applied. Since some methods like neural nets have specific requirements regarding the form of the data, there can be a loopback here to the data preparation phase. Modeling is not a mandatory step and is solely dependent on the scope of the project. In this workbook, I am going to build a machine learning model that:

  • predicts whether or not someone will respond to an offer.
  • predicts purchasing habits.
  • predicts the best offer for an individual.

All three models are trained on ensembled models.

For classification models, due to the imbalance in the target classes, we are going to use Precision, Recall, and F1 score values as the evaluation scores.

For regression models, we are going to use mean squared error and R2 values as the evaluation scores.

In [136]:
features = transaction.columns.drop(['influenced'])
In [137]:
X = transaction[features]
y = transaction['influenced']
In [138]:
numerical_columns = ['age', 'income','became_member_on_year', 'became_member_on_month','became_member_on_date',
                     'difficulty','duration','reward','amount']
In [139]:
features = transaction.columns.drop(['age', 'income', 'gender_F', 'gender_M', 'gender_O','became_member_on_year', 
                                     'became_member_on_month','became_member_on_date','duration', 'bogo', 'discount',
                                     'informational', 'email', 'mobile', 'social', 'web','influenced','offer_code_0',
                                     'offer_code_1', 'offer_code_2','offer_code_3', 'offer_code_4', 'offer_code_5',
                                     'offer_code_6','offer_code_7', 'offer_code_8', 'offer_code_9', 'offer_code_10'])
In [140]:
X = transaction[features]
y = transaction['influenced']
In [141]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [142]:
transformer = make_union(StandardScaler())
In [143]:
clf = RandomForestClassifier()

pipeline = Pipeline([
    ('transformer',transformer),
    ('classifier',clf)
])

parameters = [
    {
        "classifier__n_estimators": range(10,110,10)
    },
    {
        "classifier": [AdaBoostClassifier()],
        "classifier__n_estimators": range(10,110,10),
        "classifier__learning_rate":np.linspace(0.1,2.5,20)
    },
    {
        "classifier": [ExtraTreesClassifier()],
        "classifier__n_estimators": range(10,110,10)
    },
    {
        "classifier": [GradientBoostingClassifier()],
        "classifier__n_estimators": range(10,110,10),
        "classifier__learning_rate":np.linspace(0.1,2.5,20)
    }
]
In [144]:
clf = AdaBoostClassifier()

pipeline = Pipeline([
    ('classifier',clf)
])

parameters = [
    {
        "classifier__n_estimators": [10],
        "classifier__learning_rate":[1.8684210526315792]
    }
]
In [145]:
scoring = make_scorer(f1_score)

# Change n_jobs to -1 if you're running more than or less than 8 core cpu.
gridSearch = GridSearchCV(pipeline,
                          parameters,
                          verbose=2,
                          n_jobs = 6,
#                          n_jobs = -1,
                          cv = 5,
                          scoring=scoring,
                          return_train_score=True)
In [146]:
%%time
influnce_clf = gridSearch.fit(X_train,y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10 
[CV] classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10 
[CV] classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10 
[CV] classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10 
[CV] classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10 
[CV]  classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10, total=   0.2s
[CV]  classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10, total=   0.2s
[CV]  classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10, total=   0.2s
[CV]  classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10, total=   0.2s
[CV]  classifier__learning_rate=1.8684210526315792, classifier__n_estimators=10, total=   0.2s
[Parallel(n_jobs=6)]: Done   3 out of   5 | elapsed:    0.3s remaining:    0.2s
[Parallel(n_jobs=6)]: Done   5 out of   5 | elapsed:    0.4s finished
CPU times: user 268 ms, sys: 93.3 ms, total: 361 ms
Wall time: 738 ms
In [147]:
y_pred = influnce_clf.predict(X_test)
In [148]:
y_train_pred = influnce_clf.predict(X_train)
In [149]:
print(classification_report(y_true=y_train,y_pred= y_train_pred))
             precision    recall  f1-score   support

          0       0.99      0.83      0.91     48152
          1       0.46      0.97      0.63      7103

avg / total       0.93      0.85      0.87     55255

In [150]:
print(classification_report(y_true=y_test,y_pred= y_pred))
             precision    recall  f1-score   support

          0       0.99      0.83      0.91     12021
          1       0.46      0.97      0.63      1793

avg / total       0.93      0.85      0.87     13814

In [151]:
accuracy_score(y_true=y_train,y_pred= y_train_pred), accuracy_score(y_true=y_test,y_pred= y_pred)
Out[151]:
(0.8515428468011944, 0.8501520196901694)
In [152]:
f1_score(y_true=y_train,y_pred= y_train_pred), f1_score(y_true=y_test,y_pred= y_pred)
Out[152]:
(0.626814066693963, 0.6262188515709642)
In [153]:
influnce_clf.best_estimator_
Out[153]:
Pipeline(memory=None,
     steps=[('classifier', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.8684210526315792, n_estimators=10,
          random_state=None))])
In [154]:
influnce_clf_model = influnce_clf.best_estimator_.get_params('classifier')['classifier']
In [155]:
influnce_clf_model.feature_importances_
Out[155]:
array([0.1, 0.2, 0.7])
In [156]:
df = pd.DataFrame(list(zip(X.columns,influnce_clf_model.feature_importances_)),columns=['Feature','Importance'])
In [157]:
trace0 = go.Bar(x=df.Feature,
                y=df.Importance,
                text=df.Importance,
                textposition = 'auto',
                name = 'Offer Received',
                marker=dict(color='rgba(0,107,164,1)',))


layout = go.Layout(
    title = 'Feature Importance for Model to predict Influence',
    xaxis=dict(tickangle=-45),
    barmode='group',
    bargap=0.15
)

fig = go.Figure(data=go.Data([trace0]),
                layout=layout)

iplot(fig)
In [158]:
##Uncomment if needed
#dump(influnce_clf, 'influnce_clf.joblib') 
In [159]:
features = transaction.columns.drop(['amount','influenced'])
In [160]:
X = transaction[features]
y = transaction['amount']
In [161]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [162]:
transformer = make_union(StandardScaler())
In [163]:
clf = RandomForestRegressor()

pipeline = Pipeline([
    ('transformer',transformer),
    ('classifier',clf)
])

parameters = [
    {
        "classifier__n_estimators": range(10,110,10)
    },
    {
        "classifier": [AdaBoostRegressor()],
        "classifier__n_estimators": range(10,110,10),
        "classifier__learning_rate":np.linspace(0.1,2.5,20)
    },
    {
        "classifier": [GradientBoostingRegressor()],
        "classifier__n_estimators": range(10,110,10),
        "classifier__learning_rate":np.linspace(0.1,2.5,20)
    }
]
In [164]:
clf = GradientBoostingRegressor()

pipeline = Pipeline([
    ('transformer',transformer),
    ('classifier',clf)
])

parameters = [
    {
        "classifier__n_estimators": range(90,130,10),
        "classifier__learning_rate":[0.1]
    }
]
In [165]:
scoring = make_scorer(r2_score)

gridSearch = GridSearchCV(pipeline,
                          parameters,
                          verbose=2,
                          n_jobs = 6,
#                          n_jobs = -1,
                          cv = 5,
                          scoring=scoring,
#                          refit='F1',
                          return_train_score=True)
In [166]:
%%time
amount_clf = gridSearch.fit(X_train,y_train)
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] classifier__learning_rate=0.1, classifier__n_estimators=90 ......
[CV] classifier__learning_rate=0.1, classifier__n_estimators=90 ......
[CV] classifier__learning_rate=0.1, classifier__n_estimators=90 ......
[CV] classifier__learning_rate=0.1, classifier__n_estimators=90 ......
[CV] classifier__learning_rate=0.1, classifier__n_estimators=90 ......
[CV] classifier__learning_rate=0.1, classifier__n_estimators=100 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=90, total=   9.1s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=100 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=90, total=   9.2s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=100 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=90, total=   9.3s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=100 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=90, total=   9.3s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=100 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=90, total=   9.4s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=110 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=100, total=  10.3s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=110 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=100, total=  10.4s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=110 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=100, total=  10.6s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=110 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=100, total=  10.6s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=110 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=100, total=  10.6s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=120 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=110, total=  11.5s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=120 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=110, total=  11.7s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=120 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=110, total=  11.4s
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=110, total=  11.4s
[CV] classifier__learning_rate=0.1, classifier__n_estimators=120 .....
[CV] classifier__learning_rate=0.1, classifier__n_estimators=120 .....
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=110, total=  11.4s
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=120, total=  12.2s
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=120, total=  11.5s
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=120, total=  11.0s
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=120, total=   5.0s
[CV]  classifier__learning_rate=0.1, classifier__n_estimators=120, total=   5.0s
[Parallel(n_jobs=6)]: Done  20 out of  20 | elapsed:   36.9s remaining:    0.0s
[Parallel(n_jobs=6)]: Done  20 out of  20 | elapsed:   36.9s finished
CPU times: user 5.35 s, sys: 211 ms, total: 5.56 s
Wall time: 42 s
In [167]:
y_pred = amount_clf.predict(X_test)
In [168]:
y_train_pred = amount_clf.predict(X_train)
In [169]:
r2_score(y_true=y_test,y_pred=y_pred), r2_score(y_true=y_train,y_pred=y_train_pred)
Out[169]:
(0.3281844040690566, 0.31659967566001)
In [170]:
mean_squared_error(y_true=y_test,y_pred=y_pred),mean_squared_error(y_true=y_train,y_pred=y_train_pred)
Out[170]:
(2275.5176108785217, 2312.9681967676943)
In [171]:
amount_clf
Out[171]:
GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('transformer', FeatureUnion(n_jobs=1,
       transformer_list=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))],
       transformer_weights=None)), ('classifier', GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate...s=100, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=6,
       param_grid=[{'classifier__n_estimators': range(90, 130, 10), 'classifier__learning_rate': [0.1]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=make_scorer(r2_score), verbose=2)
In [172]:
amount_clf.best_estimator_
Out[172]:
Pipeline(memory=None,
     steps=[('transformer', FeatureUnion(n_jobs=1,
       transformer_list=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))],
       transformer_weights=None)), ('classifier', GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate...s=120, presort='auto', random_state=None,
             subsample=1.0, verbose=0, warm_start=False))])
In [173]:
amount_clf_model = amount_clf.best_estimator_.get_params('classifier')['classifier']
In [174]:
amount_clf_model.feature_importances_
Out[174]:
array([0.09028357, 0.18890856, 0.0067576 , 0.01928344, 0.00973714,
       0.14015558, 0.09350301, 0.07693956, 0.0050788 , 0.07746227,
       0.        , 0.        , 0.01853818, 0.09378324, 0.01244317,
       0.00336117, 0.        , 0.01390844, 0.1283111 , 0.00150472,
       0.        , 0.        , 0.        , 0.00758537, 0.        ,
       0.00555205, 0.        , 0.00584507, 0.00105798])
In [175]:
df = pd.DataFrame(list(zip(X.columns,amount_clf_model.feature_importances_)),columns=['Feature','Importance'])
In [176]:
trace0 = go.Bar(x=df.Feature,
                y=df.Importance,
                name = 'Offer Received',
                marker=dict(color='rgba(0,107,164,1)',))


layout = go.Layout(
    title = 'Feature Importance for Model to predict Amount',
    xaxis=dict(tickangle=-45),
    barmode='group',
    bargap=0.15
)

fig = go.Figure(data=go.Data([trace0]),
                layout=layout)

iplot(fig)
In [177]:
##Uncomment if needed
#dump(amount_clf, 'amount_clf.joblib') 
In [178]:
transaction_for_offer = generate_transaction_without_dummies(transcript_clean, profile_for_ml, portfolio_for_ml)
In [179]:
features = transaction_for_offer.columns.drop(['difficulty', 'duration', 'bogo', 'discount','informational', 'email', 
                                               'mobile','social', 'web', 'reward','offer_code'])
In [180]:
X = transaction_for_offer[features]
y = transaction_for_offer['offer_code']
In [181]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [182]:
clf = RandomForestClassifier()

pipeline = Pipeline([
    ('classifier',clf)
])

parameters = [
    {
        "classifier__n_estimators": range(10,110,10)
    },
    {
        "classifier": [AdaBoostClassifier()],
        "classifier__n_estimators": range(10,110,10)
    },
    {
        "classifier": [ExtraTreesClassifier()],
        "classifier__n_estimators": range(10,110,10)
    },
    {
        "classifier": [GradientBoostingRegressor()],
        "classifier__n_estimators": range(10,110,10),
        "classifier__learning_rate":np.linspace(0.1,2.5,20)
    }
]
In [183]:
clf = AdaBoostClassifier()

pipeline = Pipeline([
    ('classifier',clf)
])

parameters = [
    {
        "classifier": [AdaBoostClassifier()],
        "classifier__n_estimators": [10]
    }
]
In [184]:
#scoring = make_scorer(f1_score)

# Change n_jobs to -1 if you're running more than or less than 8 core cpu.
gridSearch = GridSearchCV(pipeline,
                          parameters,
                          verbose=2,
                          n_jobs = 6,
#                          n_jobs = -1,
                          cv = 5,
#                          scoring=scoring,
                          return_train_score=True)
In [185]:
%%time
offer_code_clf = gridSearch.fit(X_train,y_train)
Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10 
[CV] classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10 
[CV] classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10 
[CV] classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10 
[CV] classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10 
[CV]  classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10, total=   0.6s
[CV]  classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10, total=   0.7s
[CV]  classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10, total=   0.7s
[CV]  classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10, total=   0.7s
[CV]  classifier=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=50, random_state=None), classifier__n_estimators=10, total=   0.7s
[Parallel(n_jobs=6)]: Done   3 out of   5 | elapsed:    1.1s remaining:    0.7s
[Parallel(n_jobs=6)]: Done   5 out of   5 | elapsed:    1.1s finished
CPU times: user 513 ms, sys: 136 ms, total: 649 ms
Wall time: 1.64 s
In [186]:
y_pred = offer_code_clf.predict(X_test)
In [187]:
y_train_pred = offer_code_clf.predict(X_train)
In [188]:
print(classification_report(y_true=y_train,y_pred= y_train_pred))
             precision    recall  f1-score   support

          0       0.50      1.00      0.66     11198
          1       0.00      0.00      0.00      4430
          2       0.00      0.00      0.00      4418
          3       0.18      0.50      0.27      4407
          4       0.00      0.00      0.00      4439
          5       0.12      0.11      0.12      4443
          6       1.00      0.00      0.00      4360
          7       0.17      0.21      0.19      4464
          8       0.13      0.28      0.18      4388
          9       0.19      0.07      0.11      4299
         10       0.00      0.00      0.00      4409

avg / total       0.24      0.30      0.20     55255

In [189]:
print(classification_report(y_true=y_test,y_pred= y_pred))
             precision    recall  f1-score   support

          0       0.50      1.00      0.66      2849
          1       0.00      0.00      0.00      1121
          2       0.00      0.00      0.00      1072
          3       0.18      0.47      0.26      1093
          4       0.00      0.00      0.00      1093
          5       0.11      0.11      0.11      1102
          6       1.00      0.00      0.00      1112
          7       0.16      0.21      0.18      1080
          8       0.13      0.27      0.17      1103
          9       0.21      0.07      0.11      1137
         10       0.00      0.00      0.00      1052

avg / total       0.25      0.30      0.20     13814

In [190]:
accuracy_score(y_true=y_train,y_pred= y_train_pred), accuracy_score(y_true=y_test,y_pred= y_pred)
Out[190]:
(0.29577413808705094, 0.2951353699145794)
In [191]:
offer_code_clf.best_estimator_
Out[191]:
Pipeline(memory=None,
     steps=[('classifier', AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
          learning_rate=1.0, n_estimators=10, random_state=None))])
In [192]:
offer_code_clf_model = offer_code_clf.best_estimator_.get_params('classifier')['classifier']
In [193]:
offer_code_clf_model.feature_importances_
Out[193]:
array([0.7, 0.1, 0. , 0.1, 0. , 0. , 0. , 0.1, 0. , 0. ])
In [194]:
df = pd.DataFrame(list(zip(X.columns,offer_code_clf_model.feature_importances_)),columns=['Feature','Importance'])
In [195]:
trace0 = go.Bar(x=df.Feature,
                y=df.Importance,
                name = 'Offer Received',
                marker=dict(color='rgba(0,107,164,1)',))


layout = go.Layout(
    title = 'Feature Importance for Model to predict Offer Code',
    xaxis=dict(tickangle=-45),
    barmode='group',
    bargap=0.15
)

fig = go.Figure(data=go.Data([trace0]),
                layout=layout)

iplot(fig)
In [196]:
#Uncomment if needed
#dump(offer_code_clf, 'offer_code_clf.joblib') 

7. Evaluation

Once one or more models appear to have a high quality based on loss functions, these need to be tested to ensure they generalize against unseen data and that all critical business issues. The result is the selection of the champion model(s).

When we first started working on the data, we came up with three potential models. Once the models are trained, only one of them is useful.

  • Model to predict whether an individual is influenced by promotion or not is highly dependent on the amount spent. This model is highly reliant on after the action (purchase). Therefore it is not useful.

  • Model to predict what will be the best offer for an individual is yielding very low scores on testing data and decent scores on the training data ( tending towards overfitting/ High Variance). Because of this reason this model is discarded.

  • Model to predict the purchasing habits of individuals yielded similar scores on both training and testing data (thereby achieving a good trade-off between bias and variance). We are going to use this model in the Web application to make the predictions.

There are a few takeaways:

  1. Contrary to conventional software development, it is not always feasible to materialize the business requirements. Sometimes more data will help, but in our current case, we are set to work within the data provided.
  2. Evaluation of the test set was part of the modeling phase; this is not common in the real world. As there are no dedicated testers for this project both modeling and evaluation are done simultaneously. At no point, testing data is exposed to pretrained models.

8. Deployment

Generally, this will mean deploying a code representation of the model into an operating system to score or categorize new unseen data as it arises and to create a mechanism for the use of that new information in the solution of the original business problem. Importantly, the code representation must also include all the data prep steps leading up to modeling so that the model will treat new raw data in the same manner as during model development.

I have created the web application which will utilize the data analysis and the pretrained models. A complete description of the steps that need to be followed to launch the web application is mentioned in the README.MD file.

Please find the screenshot from the web application.

Overview of the Dataset

Path: http://0.0.0.0:3001/

Predict Amount

Path: http://0.0.0.0:3001/predict_amt

9. Foot Notes

If you are using the workspace, you will need to go to the terminal and run the command conda update pandas before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.

You can see how to access the terminal and how the install works using the two images below. First you need to access the terminal:

Then you will want to run the above command:

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.